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ABSTRACT 

In block-based video coding, the current frame to be encoded is decomposed into blocks of the same size, and 
a motion vector is used to improve the prediction for each block. The motion vectors and the difference frame, 
which contains the blocks* prediction errors, must be encoded with bits. Typically, choosing a smaller block size 
(using more motion vectors) will improve the prediction and hence decrease the number of difference frame bits, 
but it will increase the number of motion bits since more motion vectors need to be encoded. Not surprisingly, 
there must be some value for the block size that optimizes the tradeoff between motion and difference frame bits,' 
and thus minimizes their sum (the total number of bits for the frame). Despite the widespread experience with 
block-based video coders, there is little analysis or theory that quantitatively explains the effect of block size on 
encoding bit rate, and ordinarily the block size for a coder is chosen based on empirical experiments on video 
sequences of interest. In this work, we derive a procedure to determine the optimal block size that minimizes 
the encoding rate for a typical block-based video coder. To do this, we analytically model the effect of block 
size and derive expressions for the encoding rates for both motion vectors and difference frames, as functions of 
block size. Minimizing these expressions leads to a simple formula that indicates how to choose the block size in 
these types of coders. This formula also shows that the best block size is a function of the accuracy with which 
the motion vectors are encoded and several parameters related to key characteristics of the video scene, such as 
image texture, motion activity, interframe noise, and coding distortion. We implement the video coder and use 
our analysis to optimize and explain its performance on real video frames. 

Keywords: video coding, motion estimation, motion field resolution, block size, motion vector accuracy. 



1 INTRODUCTION 

In block-based, motion-compensated video coding, a motion field with one motion vector per image block is 
used to improve the prediction of the frame to be encoded. These motion vectors must themselves be encoded 
with bits. The benefit of this strategy is that the sum of motion bits plus those used to encode the motion- 
compensated difference frame is often significantly less than the number of bits needed to encode the difference 
frame without motion compensation. 
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The number of motion bits depends on the number of motion vectors and the accuracies with which they are 
encoded. Increasing either will improve the prediction and hence decrease the number of bits needed to encode 
the difference frame. Up to a point, it will also decrease the total number of bits (i.e., the motion bits plus the 
difference frame bits). 

In this paper we focus on the classical coding scheme where in each frame all blocks have the same size of 
BxB pixels (e.g., 4x4, 8x8, 16 x 16) and all motion vectors are encoded with the same pixel accuracy A (e.g., 
A ss 1, 1/2, i/4),i-3. 12,13,16,17 In co d er8} a jsg Ue i s t 0 fi nc j the block size B* and the motion vector 
accuracy A* that minimize the total encoding rate (in bits/pixel) for a given level of distortion. 

Ordinarily, the choice of B and A is based on heuristics or empirical experiments on scenes of interest. For 
example, most block-based video coders use pixel accurate (A = 1) or half-pixel accurate (A = 1/2) motion 
vectors. 1 "" 4,12 ' 13 MPEG 2 and H.261 3 use block size B = 16. Recently, it was found that B = 2 was best for 
ultrasound video, 13 and that, with the appropriate motion coder, B = 8 was better than B = 16 for some typical 
video scenes. 12 In H.263, 4 there is the option of splitting 16 x 16 blocks into four 8x8 subblocks. But despite 
the widespread experience with block-based video coders, there has been little analysis or theory that predicts 
or explains how the best choices of B and A depend on the key characteristics of the video scene, such as image 
texture, motion activity, interframe noise, and coding distortion. 

There has been a significant amount of work on finding methods to design object oriented or variable- block size 
motion fields for individual frames. B ~ 7, 9-11 Though generally these methods could be applied to the fixed block 
size motion fields considered in this paper, they are fairly complex when based on rate-distortion optimizations 6,11 
and suboptimal when based on simple ad hoc criteria or greedy optimizations. 5 » 7 » 9 » 10 Further, they do not provide 
analysis or theory that explains or predicts how the best block size depends on motion accuracy and the key scene 
characteristics. 

The problem of finding the effect of the motion vector accuracy A on the difference frame was first considered 
by Girod. 8 Recent work of the present authors 16,17 focused on finding the optimal accuracy A* for a fixed block 
size B in the context of certain simple, appropriate models. However, neither the functional dependence of A* 
on B, nor the optimal block size B* , were addressed. 

In this paper, we extend our previous work 16,17 to find the optimal block size B* as a function of A in the 
context of similar models. Also, we derive an expression for the optimal motion vector accuracy A* as a function 
of B. Specifically, we consider a simple block-based video coder that uses scalar quantization with entropy coding 
of the difference frame and lossless differential encoding of the motion vectors. We find models for the rates of 
these coders as functions of block size B and motion accuracy A that are simple enough that we are able to derive 
formulas for the optimal B* and A*. These formulas show the effects of the key characteristics of the video scene 
on B* and A*. For example, they indicate that the block size should be smaller for higher textured frames and 
when the motion vectors are encoded with low accuracy. On the other hand, the block size should be larger for 
larger coding distortions and when the variance of the motion vectors is small or the motion vector correlation is 
large. 

Finally, we implement the video coder and use these formulas to optimize and explain its performance on 
real video frames. The experimental results suggest that our equations are accurate and that significant bit rate 
savings can be achieved with our optimization procedures. 



2 THE BLOCK-BASED VIDEO CODER 

In a typical block-based video coder a prediction for the frame to be encoded is formed by subdividing the 
frame into blocks (e.g., 8x8 pixels per block), finding the motion vector for each block that displaces it to match 



a block in the previous frame 77 , and then displacing the blocks in the decoded previous frame according to the 
motion vectors. Subtracting the predicted frame from the actual frame yields a difference frame. 

In our coder, the difference frame pixels are encoded by a uniform scalar quantizer with quantization step size 
Q followed by a first-order entropy coder where the code used is adapted to that particular difference frame. Such 
a coder is adequate for this study because of the low correlation between pixel values in the difference frame. 8,7 
However, as in our previous work, 16,17 it is anticipated that the approach followed here could be extended to 
more advanced coders and that the general nature of the conclusions found here remains relevant. 

The motion vectors are computed with accuracy A and are differentially encoded (DPCM) with lossless 
entropy coding. This technique is popular in block-based video coders since it has been shown to perform well in 
practice. 1 " 4,14 It is important to realize that the effect of B on the coder performance depends on the choice of 
motion vector coder. For example, observe that a motion vector coder that encodes four motion vectors produced 
for B — 8 in a lossy fashion may be equivalent to a motion vector coder that losslessly encodes motion vectors 
produced for B = 16 (c.f. Sipitca and Madisetti 12 ), and these two coders will clearly have different optimal block 
sizes. Because of this, the equations and results in this work are only valid for the typical differential encoding 
of motion vectors. Nevertheless, other types of motion coders could also be considered within our framework by 
simply modifying the motion rate model. 



2.1 Rate for the difference frame R D 



The expected rate produced by our difference frame coder when encoding a particular difference frame is 
Rd « H(Q)) where H(Q) is the empirical entropy of the Q-quantized difference frame pixel values. We assume 
that the difference frame pixels have a discrete Laplacian form, 7 with variance a 2 and use the approximate 
formula 16 - 17 : 



H(Q) « 



log 2 V2ea - log 2 Q, c 2 > ^ (low distortion) 



Q 3 In 2 



2e 

cr 2 , a 2 < (high distortion). 



Note that <r 2 will be larger than Q 2 /(2e) for small values of Q (low distortion), and smaller for large values of Q 
(high distortion). The approximation in (1) for low distortion is obtained by relating the entropy H(Q) to the 
differentia] entropy of a Laplacian, 16 and for high distortion by choosing the linear function of a 2 to meet the log 
function at a point of equal slope, which turns out to be at cr 2 = Q 2 /(2e). From now on, we focus on the more 
interesting high distortion case (Q 2 > 2e<r 2 ), which typically corresponds to PSNR's of 40 dB or less. 

As our estimate of cr 2 we choose: 

P/B* 

p £ *. m 

where is the energy of the tth block in the difference frame, and P and P/B 2 are the total number of pixels 
and blocks in a frame, respectively. It has recently been shown 16,17 that for pixel or subpixel motion vector 
accuracies, i.e., A < 1, can be approximated by the quadratic formula: 

Ei w^A 2 -*- Ci, (3) 

where 

A ' = ^2 £ {{Fi*,y)-F(* + i,v)) 2 + (n*,v)-F{*,y + i)) 2 ), (4) 

(x,t/)tBlock i 



Our analyais can also apply to backward or intcrpolative prediction. 



with F(x y y) the intensity value of the pixel at coordinates (ar, y) in the current frame F. Note that Ai is a simple 
measure of the texture in the j'th block, and d is the energy in the ith block of the difference frame if the block's 
motion vector was found with infinite accuracy (A = 0). Clearly, Q would be zero if an exact match of the tth 
block could be found in the previous frame, but in practice there is interframe noise and other phenomena and 
C{ > 0. Specifically, we model C, as follows: 

Q 2 

Ci = C,,o + YjB 2 + S it (5) 

where Ci.o is the interframe noise energy in the block (e.g., due to light changes, occlusions, camera noise, etc.), 
Q 2 /12 is the approximate coding distortion introduced in the encoded previous frame, and Si is the error energy 
caused by having several pixel motions in a block (e.g., due to rotations, several moving objects in a block, etc.). 
Si would be zero if all the pixels in the block moved with the same velocity, but in practice 5, is approximately 
zero only for small block size B and increases with larger B. In fact, in Section 6.1 of the Appendix we show 
that assuming a separable, first-order autoregressive model for the ideal motion vector field of F, Si can be 
approximated by 

Si « 6^1n(l/a)yl,£, (6) 
where a 2 / and a are the variance and correlation coefficient of components of the ideal motion vectors (same for 
the X and Y components), respectively. A fairly accurate estimate of a 2 , can be computed from the empirically 
measured motion vectors at each block, but an accurate estimate of a requires finding the ideal motion vector at 
every pixel, also known as optical flow, which is very complex (c.f., Barron et al lB ). In practice, a can be fixed 
for a large variety of scenes, and based on the experiments in Guillotel and Chevance's work, 14 we fix a = 0.998 7 . 
Combining (1), (2), (3), (4), (5), and (6) yields the following approximation to the total difference frame rate R D 
for the low distortion case: 

R o^Q^{^T A +&<TlH\/a)T A B + ^ + ^, (7) 

where 

1 P,B * 1 

T -< = p £ * = Top £ ((^(«, ») - F(« + 1. y))» + (/!■(«, y) - F(«. y + (8) 
is a measure of the texture (per pixel) in the current frame F (independent of the block size value B), and 

H=pE C °.i 0) 

1 = 1 

is the average interframe noise (per pixel). Observe that /i is not a function B % because recall that the Co/s are 
noise energies caused by light changes, occlusions, etc., which will add to the same total /xP regardless of the 
value of block size. (The value of fi will not be needed.) 



2.2 Motion vector rate R M 

The motion vectors are losslessly encoded by DPCM with entropy coding. Let n ( - = (n Zti} n yii ) be the DPCM 
prediction error for the motion vector at the tth block. Note that the n x /s and n yti } s are integer multiples of A, 
the pixel accuracy with which the motion vectors are found. We model the {n Xiii n Vti )?lf values as A-quantized 
outcomes of a continuous Laplacian random variable H with variance /?^. Then, the total expected rate invested 
in motion is: 

P/B* 

p E ( 2 M^)-log 2 A 2 ), (10) 



that the correlation coefficient of the ideal motion vectors is significantly higher than that of the pixel values - typically 0.9 



with 

hW) = -log 2 2e 2 fo, (11) 

the differential entropy of N. The approximation in (10) is obtained, like the low distortion case in (1), by relating 
the entropy of the n X| »'s and n y /s to the differential entropy of a Laplacian. 15 

Clearly, the prediction error variance will increase with B because the correlation of adjacent motion 
vectors decreases when the vectors are further apart. In fact, in Section 6.2 of the Appendix we show that 
assuming a separable, first-order autoregressive model for the computed motion vectors and neglecting small 
terms: 

/&« 2<T 2 v ]n(\/a)B ) (12) 

where cr v and a are the average variance and correlation coefficient, respectively, measured for the X and Y 
components of the computed motion vectors. It is important to observe that the ideal motion field (which we 
previously modeled also with an AR process) and the field of computed motion vectors have different statistics. 
In practice, our final results are not sensitive to the differences between the variance of the ideal motion field 
<r v and that of the computed field <r v> and we set <r v = a v . But the correlation factor a of the ideal motion 
field (which we fixed to 0.998 based on accurate motion estimates 14 ) may be significantly larger than that of 
the computed field (which we find with classical block matching 1 ). Because of this, we will estimate a from the 
computed motion vectors, and since we will not compute the motion vectors at every pixel but at every block, 
we will use equation (23) with the appropriate value of B. 

Now, combining (10), (11), and (12), we obtain an expression for the motion vector rate: 



P( B 

*«*^£ 0og 2 (4e 2 4 ln(l/3)fl) - log 2 A 2 ) = ± log 2 If^Ml^ , (13) 



2.3 Total rate R 

Finally, we sum the motion rate R M (10) and the difference frame rate R D (7), and obtain the following 
approximation to the total rate for encoding the current frame: 

R = Rd 4- Rm 
e 



(AT, + 6 * Wla)T A B + g + „) + £ log, (M) 



Observe that (14) indicates the effect of block size B and other interesting parameters on the encoding rate 
R. As expected, the difference frame rate Rd increases with larger B, because the difference frame energy due 
to non-uniform motion, "6 <r\ In (1 /a) T A B" } increases with B. Also, the motion rate Rm typically increases with 
smaller B, since reducing the block size increases the number of motion vectors, as indicated by "1/B 2 ". However, 
reducing B increases the correlation between the motion vectors and fewer bits per vector are needed, as shown by 
"log 2 (4eV 2 / l n (l/5)B)/A 2 ". In fact, note that if V 2 ln(l/5)" was very small (i.e., very large motion coherence), 
the correlation effect would dominate and reducing block size would also reduce Rm (e.g., a = 0.99, A = 1, and 
u\ < 1). 

Not surprisingly, decreasing A (higher motion vector accuracy) increases I he motion rate R M , but decreases 
the difference frame rate R D since the prediction for the current frame is improved with more accurate motion 
vectors. Finally, R D increases with higher frame texture T A , coding distortion Q (or Q 7 /\2 in MSE), and 
interframe noise /i. 



3 OPTIMIZATION 



3.1 The optimal motion vector accuracy A* 

For a fixed block size B and distortion Q } we find that the motion vector accuracy A* that minimizes the 
total frame rate R (14) is 




where L = y/l2/e is a constant and D = Q 2 /\2 is the approximate MSE distortion of the encoded frame. 
Equation (15) indicates that the motion vectors must be encoded more accurately (i.e., A smaller) for larger 
block size. Also, higher motion accuracy is necessary in frames with larger texture Ta, and lower accuracy is 
needed at higher levels of compression (higher D). 



3.2 The optimal block size B* 



For a fixed motion accuracy A and distortion Q, we find that the block size B* that minimizes R (14) satisfies 

A'i£* 3 = 21og 2 B* + A' 2j 

where 

T A a\ ln(l/o) e „ ftl <rl\n{\/a) 

A j = V D V — , K 2 = 2 log 2 ^ 2 + 4 + 3 log 2 e . (16) 

The dependence of B* on A, Ta, D, v\ % a and a is explained in detail in the next section. 



4 EXPERIMENTAL RESULTS 



We present results of the performance of our video coder and of our formulas on the well-known video sequence 
"caltrain" (a scene with a moving train and a calendar) with quantizer step size Q=20 (the PSNR is approximately 
33 dB). The frame resolution was 512 x 400 pixels and only the gray-level components were encoded. We used 
the full-search block-matching method with the minimum absolute difference criteria 1 to compute the motion 
vectors with pixel and subpixel accuracy for A's in the set = {2, 1, 1/2, 1/4,..., 1/32} (subpixel accuracy was 
computed using bilinear interpolation on the pixel position with the best match), and the maximum interframe 
displacement was 8 pixels. The first frame was intracoded by a simple uniform scalar quantizer with quantization 
step Q and each of the following frames were predicted by their respective (encoded) previous frames. 

In Figure 1, the continuous line shows the total empirical rate R for encoding the second frame of "caltrain" 
using block size B = 16 (16 x 16 pixels/block) for different values of A. The two dashed lines show R(A) but for 
B = 4 and B = 8. The V signs on the A axis show our prediction A* (15) for where the minimum of R occurs. 
From left to right, the "o"'s correspond to the B = 4, B = 8, and B = 16 rate curves, respectively. The V signs 
plot the rate obtained when the A^s are rounded to the nearest A in 

In Figure 2, the continuous line corresponds to "tfjB 3 " (see (16)) for the second frame of "caltrain". We 
measured T A = 30.732, and estimated u\ = 4.095 and a = 0.837 from the computed motion vectors with B = 4 
and A = 1/4 (a = 0.998). The dashed lines correspond to "21og 2 B + /\ 2 " for different values of A. The points 
where the dashed lines cross the continuous line are the values B* for the respective A^. Observe from this plot 
and from (16) that the best block sizes B* decrease with larger frame texture T A and smaller distortion D (larger 
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Figure 1: Total empirical rate R vs. motion vector accuracy A, for fixed block size B = 4 (4 x 4 pixels per block), 
B = 8 and B = 16. The V signs on the A axis show our prediction A* (15) for where the minimum of R occurs. 
From left to right, the "o"'s correspond to the B = 4, B = 8, and B = 16 rate curves, respectively. The V signs 
plot the rate obtained with our optimal motion accuracies for each B case (when the A*'s are rounded to the 
nearest A in S& ). 
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Figure 2: A"j B 3 (solid line) vs. 2 log 2 B + K 2 (dashed lines), for different motion vector accuracies A (see equation 
(16)). The points where the dashed lines cross the continuous line are the optimal block size values B* for the 
respective A's. 
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Figure 3: Optimal block size B* vs. motion accuracy A for "cal train" video sequence. 



Ai), and that the B*'s increase with higher motion vector accuracy (larger A'2). Also, note from (16) that large 
motion coherence (small cr v and large a, a) decreases both K\ and AV Typically, the effect of K\ will dominate 
and the crossings will occur at larger values of B. In other words, the optimal B*'s usually increase for more 
coherent motion fields. But in the rare case that "cr v ln(l/a)/A 2 " is extremely small (e.g., a = 0.99, A = 1, and 
Oy < 1), then the effect of A'2 would dominate and decrease the B* y s. 

In Figure 3, the continuous line shows the optimal block size B* plotted versus motion accuracy A for the 
second frame of "caltrain" . The B* 's (the line crossings in Figure 2) are found using a simple search procedure 
in (16). 

In Figure 4 the continuous line shows the total empirical rate R for encoding the second frame of "caltrain" 
using pixel accurate motion vectors (A=l) for different values of block size B. The dashed, dash-dotted, and 
dotted lines show R(B) but when the motion vectors are found with A = 1/2, A = 1/8, and A = 1/32, 
respectively. The "o" signs on the B axis show our prediction B* for where the minimum of R occurs. From left 
to right, the V"s correspond to the A = 1, A = 1/2, A = 1/8 and A = 1/32 rate curves, respectively. The V 
signs plot the rate obtained with our optimal block size when the B* 's are rounded to the nearest B in the set of 
interest S B = {4,8, 16,32}. 

Figure 5 is like Figure 1 but for the third frame of "caltrain". Results for this and other frames of "caltrain" 
are very similar to those of Figures 1, 2, 3, and 4 because the texture and the motion vector statistics remain 
approximately constant in the entire sequence. 

The results suggest that our equations work well. Our optimal B**s appear to be close to the minimum of the 
rate curves in Figure 4. The exact value of the £*'s that minimize the underlying continuous rate curves is not 
known, but when we round our B* to the closest B in the set of interest 5b, the rate that we obtain is always 
the minimum. Even though block sizes B = 8 and B = 16 perform similarly for "caltrain" with pixel (A = 1) 
and half-pixel (A = 1/2) accurate motion vectors at the considered level of distortion (Figure 1), our method 
indicates that 5 = 8 should be chosen for those lower motion accuracies (recall Figure 4). This information is 
important because smaller B's are preferred to reduce blocking artifacts from the block-based prediction. Finally, 
our optimal A*'s also appear to be close to the minimum of the rate curves in Figures 1 and 5. Again, the exact 
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Figure 4: Total empirical rate R vs. block size B, for fixed motion accuracy A. The V signs on the B axis show 
our prediction B* for where the minimum of R occurs. From left to right, the Vs correspond to the A = 1 
(pixel accurate vectors), A = 1/2, A = 1/8 and A = 1/32 rate curves, respectively. The V signs plot the rate 
obtained with our optimal block size (when the B*'s are rounded to the nearest B in Sb) . 
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Figure 5: Total empirical rate R vs. motion vector accuracy A. Like Figure 1 but for the third frame of "caltrain" 



value of the A's that minimize the underlying continuous rate curves is not known, but when we round our A*'s 
to the closest A in the set of interest S A , the rate that we obtain is either the minimum or very close to the 
minimum. 



5 CONCLUSIONS 



In a classical block-based video coder, our simple techniques find at every frame the optimal block size B* for 
a given motion vector accuracy A, and the optimal A* for a given B. Our formulas and results show the effects of 
the key characteristics of the video scene on B* and A*. Specifically, they indicate that the block size should be 
smaller for higher textured frames and when the motion vectors are encoded with low accuracy. Also, the block 
size should be larger for larger coding distortions, and when there is motion coherence in the frame (i.e., small 
motion vector variance and large motion vector correlation). Consequently, it is not surprising that Docef and 
Smith 13 found B = 2 to be the best block size for ultrasound video, since the structure and texture components 
of those sequences have high textured frames and low motion coherence. Conversely, our equations indicate that 
the motion vectors should be encoded more accurately for larger blocks and when there is higher texture in the 
frame, and less accurately when the level of compression is high. 

Some of the above relationships have been noted by other researchers and have been used heuristically in 
practice. Other relationships are more confusing and not so well understood. For example, it is not obvious 
whether larger motion coherence should lead to decreasing the block size in order that the motion vector coder 
can take advantage of the motion vector correlation, or to larger blocks so there will be fewer motion vectors 
to encode without substantially increasing the difference frame rate. In fact, our equations model both effects 
(c.f. see (14)) and show that the second interpretation is typically the correct one, except in the rare case when 
"ovln(l/a)" is very small (i.e., very large motion coherence). Not surprisingly, Martin et a/. 10 designed fairly 
good variable block-size motion fields for video coders by heuristically joining blocks with coherent motion (similar 
motion vectors) into larger blocks. 

We believe that our analysis is also a good first step toward understanding and deriving simple procedures to 
find the optimal motion resolution for object-based and variable block-size video coders, which is our ongoing work. 
Finally, note that many other block-based video coders use difference frame encoders that are more sophisticated 
than our uniform scalar quantization and entropy coding. For example, MPEG 2 uses a block-by-block DCT, 
which gives better rate-distortion performance and also exploits the response of the human visual system. In 
this work, our coder does not intend to compete with these more advanced video coders but to explore the block 
size problem in a basic case. The rates of those more advanced coders are expected to be lower than ours by a 
constant, but the relative differences between the rate curves modes are expected to remain roughly the same, as 
was found in previous work. 16,17 



6 APPENDIX 



6.1 Prediction error caused by nonuniform motion in a block 

In this section, we model the effect of having different pixel motions in a block on the energy of the block's 
prediction error. To simplify this discussion, we consider one-dimensional, continuous frames. Let P(x) be a 
block of length B (in pixel units) in the current frame and P v (x) the block in the previous frame indicated by 
P's motion vector V. For convenience, assume that P[x) and P v (x) take (intensity) values for x 6 [0,S] and 
are zero outside that interval. Now, let v{x) be the ideal motion vector field for the pixels in P, so that v{x) is 
constant at each pixel, i.e., v(x) = Vi for x e 1). Here, we assume there is no prediction noise other than 



that caused by nonuniform motion, and therefore if all the pixels in P moved with the same velocity, v(x) = V t 
P(x) and P v (x) would bejdentical, P(x) = P v (x). But since typically different pixels have different velocities 
within a block, we model Py as follows: 

Pv(x) = P{x + (V-v(x))). (17) 



Next, we model the motion field v(x) as a realization of a random process V(x) such that V(x) = V] for 
x e + 1), where V{ is the ith variable of a random sequence, and assume that v(x) takes the value of the 
block's motion vector in the middle of the block. Our goal is to find a simple expression for: 

S = E[ ["(P(x) - P(x + (V(B/2) - V(x)))) 2 dx], (18) 

Since v(B/2) - v(x) is small, we expand P(x + {v(B/2) - v{x))) in a Taylor series and obtain: 

S « E[[ B (P'(x)(V(B/2)-V{x))) 2 dx)= I* P' 2 {x)E[{V(BI2)-V(x))*]dx 
Jo Jo 



(19) 



with P'(x) the derivative of P(x). In Section 6.2 we show that assuming a highly-correlated, autoregressive (AR) 
model for the random sequence V{ , 

E[(Vb/2 - Vi) 2 } « 24 ln(l/a)|S/2 - t|, (20) 

where <r\ is the variance of V { and a is the AR coefficient. Combining (19) and (20), and after some manipulations, 
we find the following approximation for S: 

S « ±<r v \n(l/a)B£ ^ P'^a:)^ « ln(l/a)B ^(P(i) ~ P(i + l)) 2 



i=0 



= 6(T 2 v }n{\/a)BA ) (21) 

where A = (1/12) J^(P[i) - + I)) 2 • In the two-dimensional case, we obtain the same result for S but A is 
like (4). 



6.2 The variance of the difference between AR random variables 

Here, we find an approximate expression for the variance of the difference between random variables of a 
zero-mean, first-order autoregressive (AR) process. Specifically, consider the following AR random sequence: 

^ = + (22) 

where Vj is the jth random variable of the sequence, a is the correlation coefficient (|a| < 1), and {n,} is a 
sequence of iid random variables with zero mean. Our goal is to obtain a simple expression for E[(Vj - K,-b) 2 ], 
where B is the distance between random variables Vj and Vj- B in the sequence. It is easy to show that: 

E[(Yl " Vj-b) 2 } = E[V 2 } + E[V 2 _ B ] - 2E[V j V j , B ) = 2a 2 - (23) 

with <r\ the variance of Vj (same for all j). To simplify (23), we expand in i Taylor series: 

a' ' = 1 + Slna + -f - > + ... (2 4) 



If a is close to 1 , observe that a « 1 + B In a, and then 

E[[V S - Vj-b) 2 } w 2tr^J31n(l/a). (25) 

Hence, the variance of the difference between random variables ^ and Vj- B of a highly correlated AR random 
sequence (a close to 1) is approximately linear with B, as expressed in (25). 
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ABSTRACT 

In block-based motion-compensated video coding, a fixed-resolution motion field with one motion vector per 
image block is used to improve the prediction of the frame to be coded. All motion vectors are encoded with 
the same fixed accuracy, typically 1 or 1/2 pixel accuracy. In this work, we explore the benefits of encoding the 
motion vectors with other accuracies, and of encoding different motion vectors with different accuracies within 
the same frame. To do this, we analytically model the effect of motion vector accuracy and derive expressions for 
the encoding rates for both motion vectors and difference frames, in terms of the accuracies. Minimizing these 
expressions leads to simple formulas that indicate how accurately to encode the motion vectors in a classical block- 
based motion-compensated video coder. These formulas also show that the motion vectors must be encoded more 
accurately where more texture is present, and less accurately when there is much interframe noise. We implement 
video coders based on our analysis and present experimental results on real video frames. These results suggest 
that our equations are accurate, and that significant bit rate savings can be achieved when our optimal motion 
vector accuracies are used. 

Keywords: video coding, motion estimation, motion vector accuracy, adaptive motion vector accuracy. 



1 INTRODUCTION 

In motion-compensated video coding, motion vectors are used to improve the prediction of the frame to be 
encoded. 1 " 4 These motion vectors must themselves be encoded with bits. The benefit of such an investment 
is that the sum of the motion rate R M (the number of bits per pixel to encode the motion vectors) plus the 
difference frame rate Rq (the number of bits per pixel to encode the motion-compensated difference frame) can 
be significantly less than the rate (in bits per pixel) required to encode the difference frame obtained without 
motion compensation. The widespread success of motion compensation is testimony to the value of this strategy. 

The motion rate Rm depends on the number of motion vectors (equivalently, the spatial resolution of the 
motion vectors) and the accuracies with which motion vectors are encoded (e.g., to one, one-half, one-quarter 
or one-eighth pixel accuracy). Increasing either will decrease the difference frame energy and, if the coder keeps 
distortion constant, will also decrease the difference frame rate R D and, up to a point, the total encoding rate 
R = Rm + Rd- 



In this paper we focus on quantifying the dependence of the total encoding rate on the accuracies with which 
motion vectors are encoded, assuming distortion is held constant. Specifically, for a given set of motion vectors, 
we find simple approximate analytical expressions for how the motion rate Rm and the difference frame rate R D 
depend on the accuracies. And we use them to optimize the performance of video coders. 

We focus on classical block-based video coders with one motion vector per image block. (The method could 
be applied equally well to variable-block sized video coders.) Each motion vector is encoded by uniform scalar 
quantization of the differences between its x and y components and those of the motion vector of the previous 
block, followed by entropy coding. The quantizer step size determines the accuracy of the motion vector — 
smaller step sizes result in larger accuracy and larger motion rate. We find simple approximate expressions 
for the resulting motion rate R M and the difference frame rate R D . The key step in developing the latter is 
the derivation of a simple quadratic expression for the energy of the resulting difference frame, in terms of the 
quantizer step sizes. Then, assuming that difference frame pixels have a Laplacian distribution, an expression for 
the difference frame rate follows easily for a coder that uses uniform scalar quantization of the difference frame 
pixels followed by entropy coding. (The method is also tested with DCT coding of the difference frames.) 

In most video coders, all motion vectors are encoded with the same accuracy, 1 " 11 and the choice of accuracy — 
typically 1 or 1/2 pixel accuracy 1-5 — and resulting motion rate are based on heuristics and empirical experiments. 
In contrast, the expressions we derive for R M and R D are simple enough that for each video frame an encoder 
can dynamically optimize the choice of the motion vector accuracy (quantizer step size) for each block, so as to 
minimize the total rate for a given level of distortion in the encoded frames. The resulting expressions indicate 
that motion vectors must be encoded more accurately where there is more texture and less accurately where the 
prediction is corrupted by phenomena such as camera noise, occlusions or distortion from the encoding of previous 
frames (when compression is high). Some of these relationships have been previously observed and heuristically 
used in the design of video coders. 5,8,9 ' 11 

Based on our techniques, we implement video coders with two types of difference frame coding: uniform 
scalar quantization with entropy coding (for which our analysis is based) and DCT coding (for which the motion 
accuracies determined for the previous case can also be used). 

Our results include rate-distortion curves for the typical motion vector accuracies (1 and 1/2 pixel) and for 
our optimal accuracies, and rate-accuracy curves for several levels of distortion. These results suggest that our 
equations are accurate, and that significant bit rate savings can be achieved if our optimal motion vector accuracies 
are considered. For example, adapting the motion vector accuracy to individual blocks with our procedure results 
in up to 17 per cent bit rate savings with respect to encoding all motion vectors with the same 1/2 pixel accuracy 
(at a peak SNR of 31 dB). Finally, we discuss potential applications of this work to other video coders and to 
block adaptive motion estimation. 

This work is an extension of previous work of the authors 13 that considered the optimization of motion vector 
accuracies for lossless video coders. There are several key differences. An obvious one is the inclusion of the lossy 
quantizer in the difference frame encoder, which necessitated a change in the expressions for Rp in terms of the 
difference frame energy. Another is the DPCM encoding of the motion vectors, in contrast to the simpler scalar 
quantization in the lossless case, which is not adequate for lossy coding. Due to this, a substantially different 
technique is needed to estimate the difference frame energy. Finally, in the lossy case, the motion rate is a larger 
fraction of the total and more benefit results from the optimization of the motion accuracies. 

Other researchers have also addressed the effect of motion vector accuracy on difference frame energy and total 
rate. Girod 7,8 found expressions for the difference frame energy as a function of motion vector error distribution, 
and the spectrum of the current frame and that of the interframe noise. He reached some interesting conclusions 
from this analysis, but unfortunately it cannot be used to adapt the accuracies on a block-by-block basis, and even 
for entire an frame the resulting expressions are too complex to be used to analytically derive the optimal motion 
vector accuracy. Vandendorpe tt al. 10 presented work similar to that of Girod 8 but for interlaced frames. Ito 
and Farvardin 11 performed a simple analysis of difference frame energy in the context of subband coding, which 
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Figure 1: "Coder 1" 



was sufficient, to conclude that the motion vectors of higher frequency subbands require more accurate encoding. 

In another interesting work, Moscheni et a/. 12 use an entropy formula, similar to ours, to predict the rate of the 
difference frame encoder based on measurements of the difference frame energy, and then decide how to allocate 
motion bits in a hierarchical motion compensation scheme. Aside from the fact that we do not consider hierarchical 
motion compensation, our method differs in that we model the difference frame energy and analytically derive 
motion allocations, whereas they had to find them by trial and error, and we jointly optimize the motion rates, 
whereas they perform a suboptimal one-rate-at-a-time optimization. 

The effect of the spatial resolution of the motion vectors, which is not considered in the present paper, has 
been addressed in some of the previously cited work 5,9,12 and other papers. 15 



2 A BLOCK-BASED VIDEO CODER 



In a typical block-based video coder a prediction for the frame to be encoded is formed by subdividing the 
frame into blocks {Pi} (e.g., 8x8 pixels per block), finding the motion vector V{ for each P» that displaces it to 
match a block in the previous frame*, and then displacing the blocks in the decoded previous frame according 
to the quantized versions of the motion vectors, {K}. (If the components of the motion vectors are not integer 
multiples of At = 1, the pixel spacing, the intensity values for the blocks in the previous frame are obtained by 
interpolation.) Subtracting the predicted frame from the actual frame yields a difference frame. This process is 
illustrated in Figure 1, 

Ordinarily, the motion vectors are found by a simple motion estimation procedure such as block-matching, 1 
which computes the motion vectors {K} with a chosen accuracy, and hence a separate quantization step is not 
actually needed (i.e., VJ = V* for all i). In this study, however, it is assumed that the motion vectors are computed 



t Our Analysis can also apply to backward prediction or interpolative prediction. 



to some very high accuracy Ao (i.e., the components of Vi are integer multiples of Ao, where Ao C At), and 
then quantized to some coarser accuracy A > Ao to obtain the Vi's. 

Figure 1 shows the block diagram for "Coder 1", a simple block-based video coder. In "Coder 1", the difference 
frame pixels are encoded by a uniform scalar quantizer with quantization step Q followed by a first-order entropy 
coder where the code used is adapted to that particular difference frame. (Note that the image pixels take values 
in {0, ...,255}. When 0 = 1, the pixels are losslessly encoded, and when Q > 1, they are lossy encoded.) Such a 
coder is an adequate choice for this study because of the low correlation between the pixel values in the difference 
frame. 7,8,12 Letting pQ{d) denote the frequency of the value d in the Q-quantized difference frame, the expected 
rate (number of bits/pixel) produced when encoding a particular difference frame is 

RDK-Y,VQ{d)\0g 2 V Q {d)±H{P Q ). (1) 
d 



The motion vectors are lossy encoded by DPCM with variable-length coding. This technique is popular in 
coding with a fixed-resolution motion field, since it has been shown to perform well in practice. 2,4,6 In our scheme, 
each motion vector is predicted by the quantized version of the previous vector, in scan order. Let nj = {n Xt i,n Vt i) 
be the prediction error for the motion vector at the ith block, with n Xt i the horizontal or x component of the 
prediction error and n Vi i the vertical or y component. We quantize n Xi i and n 1Jt i using uniform scalar quantizers 
with quantization steps A*,,- and A^j, respectively, and encode the results using, a different entropy coder for 
each value of A. In the classical video coding approach all motion vectors are encoded with the same accuracy 
A = A X) i = A V) i. In our new approach, the A's are permitted to differ from block to block, and A X) i can be 
different from Ay t ». Specifically, there is a collection of such codes {Ca x , C$ a( . . .}, and code Csj is used to encode 
the quantization of n X) i whenever A^j = 6j , and similarly for n yi ,\ Let us assume that the (n^,-, n y i} values are 
outcomes of a random variable M with variance a"^. Then, the total expected rate invested in motion is 

1 N 

Rm* p£(ff(C A . (j )+//(C A|li )) (2) 

where P is the number of pixels per frame, N is the number of blocks per frame, and H(C&) is the entropy of 
the random variable M quantized with A accuracy. For A < ajsf it can be shown 14 that 

HA0-log 2 A, (3) 
where h(Af) is the differential entropy of N . Then, the total expected rate invested in motion is 

1 N 

(2 WO - log 2 A Xti A V)i ) . (4) 

Of course, the A's must also be encoded. However, we will not count the rate for this, because it is usually 
negligible compared to Rm and Rd- Finally, the expected total rate to encode the current frame is 

1 N 

R^Rd + Rm* H(p q ) + pYl ( 2/l M " ,0 S2 A.,< A, ,<) . (5) 

i=i 



Observe that making the A's smaller increases the motion rate Rm but improves the prediction for the current 
frame, which decreases the energy in the difference frame and, if the distortion of the coder is fixed, the difference 
frame rate Ro- Hence, when the A's are small, R M is large and Ro is small, and when they are large, the 
opposite occurs. On the other hand, the A's have little effect on the overall mean squared distortion of "Coder 
1", because this is well approximated by Q 2 /12. This suggests that Q should be fixed to achieve a desired level 
of distortion and that for each frame, the encoder should dynamically choose the A's to minimize the total rate 
R for that frame, as expressed in (5). 



1 



3 THE OPTIMAL MOTION VECTOR ACCURACIES 



To minimize the expression for R in (5) for a given Q, the encoder must be able to predict H(f>q) in terms of 
Q and {A s ,i, A y ,i &z,N, A„,/v}. (Note that the constant value of h(tf) does not affect the minimization.) 

To predict H(pq), we assume that pq(d) has a discrete Laplacian form with variance a 1 and use the approx- 
imate formula 

(log 2 N/2e<7-log 2 Q, <r 2 >£ 
(6) 

Note that a 2 will be larger than Q 2 /(2e) for small values of Q (low distortion), and smaller for large values of 
Q (high distortion). The approximation in (6) for low distortion is obtained, like (3), by relating the entropy of 
Po(d) to the differential entropy of a laplacian, and for high distortion by choosing the linear function of cr 2 that 
makes the derivative of the formula continuous at a 2 = Q 2 /(2e). As our estimate of a 2 we choose 

i = l 

where Ei is the expected value of the energy of the difference between the block P, and the block in the previous 
frame at the location indicated by the quantized version of the ith motion vector Vi. Not surprisingly, it can be 
shown that the expected difference energy Ei is a function of the accuracies (A X)it A 1Jti ) with which Vi is encoded. 
In Sections 6.1 and 6.2 of the Appendix, we show that for high motion vector accuracies (e.g., A*,*, A y>i < At), 
sufficiently large and textured blocks, and in the presence of interframe noise, Ei can be approximated by 

where A it Bi and C» are parameters that depend on Pi and the variance of the interframe noise. 

The encoder estimates A i} Bi and Q as follows. Let £(P it V) denote the energy of the difference between Pi 
and the block of the previous frame corresponding to displacement V. For each (A*,*, Ay,,*) in the set of pairs of 
potential accuracies {(A T , A t ), (At, At/2), (A t , A t /4), (A r , A 0 ), .... (Ay/2, A T ), .... (A 0 , A 0 ) } , the encoder 
computes 

,(P,V j + 0 Ao, fc Ao)),; = ^ I? 0 .»=^f W 

and their average, denoted £i(A Xi i, & y> i). Then the estimate for d is 

d = i (ft(A 0l A 0 ) +£(2Ao,Ao) + £i(A 0 ,2Ao)) , (10) 

and the estimates for Ai and Bi are found by a least-squares fit of the £i(A V| <, A yi i)'s by (8) with d as given in 
(10). Observe that estimating {A it is computationally expensive, but could be combined with the ideal 

motion estimation procedure to reduce its complexity. 

Combining (5), (6), (7) and (8) yields the following approximation to the total rate for encoding the current 
frame 



R 



where 



k 1°S2 ¥■ E.Hi *h + Bitf^ + C) - log 2 Q+j, EL W - log 2 Ax,A y ,,) , <7 2 > £ 

<?nhp EH, + S ( A y 2 ,,. + CO + £ EH, (2W - log 2 A I ,,A !/ , 1 ) , 9*<£ 

■ (11) 

* 2 = ^£ (AiAlj + ftAj i4 + CO £ (12) 



3.1 Modes of operation 

We have considered three different modes of operation for encoding the motion vectors for the current frame: 

Mode 1: All motion vectors are encoded with the same accuracy, as in classical video coders. 1 *" 11 Hence, A^i = 
Aj, (1 = ... - A x> at = Aj,,n = A. 

Mode 2: Different motion vectors may be encoded with different accuracies, but for a given block A^,- = A y ,< = A». 
Mode 3: Different motion vectors may be encoded with different accuracies, and A*,* can differ from A y ,i. 

For Modes 1 and 2, the set of pairs of potential accuracies is restricted to {(Ax, At), (At/2, At/2), (Ao, Ao)}, 
Ci = £(Ao,Ao), and the least-squares fit for A% and Bi becomes one dimensional, which is much less complex. 
We minimized the total frame rate R in (11) for each of the three modes of operation, and derived the optimal 
motion vector accuracies. The results are: 



Optimized Mode 1: 



A* = I 



* 1 = Li (*t£^Y' *Uau>% 
a; = l 2 ( fr^w^+g. ) • otherwise > 



(13) 



where L\ = y/2/(l - 2N/P) and L 2 = \/l2/e are constants, D = Q 2 /\2 is the approximate distortion of the 
encoded frame, and \i = -p YliLi d 



Optimized Mode 2: 



A* = 



A i,2 = t2 ( * , otherwise. 



(14) 



Optimized Mode 3: 





= (ifc)*. 










otherwise. 
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otherwise. 





(15) 



The formulas above show how the optimal motion accuracies depend on the parameters {A{ } Bi f d) and Q, 
and apply when the motion accuracies are smaller than or equal to At- We note that: 



• {Ai,B(} tend to be large for high textured blocks and small for low textured blocks. (This is made more 
clear later in (19) and (20)). And hence our equations indicate that motion vectors of higher textured blocks 
must be encoded more accurately, because A{ and J9, appear in the denominator of the A*'s. 

• The parameter is the estimated variance a 2 of the difference frame in the ideal case where all the motion 
vectors are encoded with infinite accuracy (i.e., the A's are zero). We refer to p as the noise floor of the 
frame because it cannot be reduced by more accurate motion compensation. Note that \x would be zero if 
the motion-compensated prediction could be made identical to the current frame, which is not possible in 
real scenes because of camera noise, light changes, occlusions, distortion, etc. Our equations show that for 
larger values of \i lower motion vector accuracies are needed, because /i appears in the numerator. 

• The effect of D is similar to that of /i, but for large Q, and our equations indicate that lower accuracies are 
necessary at higher levels of compression. 

• Finally, note that the {Ai, values increase with block size, but so do those of {C,} and hence the motion 
vectors of larger blocks must be encoded more accurately only if the increase in block texture is superior 
to the increase in noise floor. In fact, larger blocks may contain objects moving differently and hence the 
noise floor will likely be superior. 



3.2 Generalization to other block-based video coders 

Many other block-based video coders use difference frame encoders that are more sophisticated than the 
uniform scalar quantization and entropy coding of "Coder 1", For example, MPEG 2 uses a block-by-block DCT, 
which gives better rate- distortion performance and also exploits the response of the human visual system. The 
rates of those more advanced coders are expected to be lower than those of "Coder 1" by approximately a constant. 
However, equations (13), (14) and (15) can still be used to find the optimal motion vector accuracies because the 
minimization procedure is not affected by a constant difference. To confirm this, we implemented a second coder 
that uses a block-by-block DCT coder for the difference frame similar to MPEG's. 2 We refer to this coder as 
"Coder 2". 



4 EXPERIMENTAL RESULTS 

We present results of the performance of our two video coders on the well-known, real video sequence "caltrain" 
(a scene with a moving train and a calendar), for different levels of distortion and different modes of operation. We 
used block-matching on 8x8 blocks with Ao = 1/64 and the minimum absolute difference criteria 1 to approximately 
compute the ideal motion vectors. Where needed, bilinear interpolation was used to find the intensity values at 
subpixel locations. Also, the motion vector accuracies predicted by (13), (14), (15), were rounded to the nearest 
A r = 2 -r (r is an integer), and were not allowed to be larger than Ar- 

In "Coder 1", the first frame was encoded by a simple uniform scalar quantizer with quantization step Q and 
its rate is not included in the results. In Figure 2, the continuous lines show the total empirical rate R of "Coder 
1" for encoding 4 frames of "caltrain" using Mode 1 with different values of A. Each line is computed for a 
different value of Q. Prom top to bottom, the continuous lines correspond to <2= 1 (lossless case), and Q= 5,15, 
25 (lossy case). (Q=2b corresponds to a peak SNR of approximately 31 dB.) The "o" signs on the A axis show 
our prediction (13) for the values A* where the minima of the R lines occur. From right to left, the "o" signs 
correspond to the R lines for Q= 1, 5, 15, 25. The "x" signs on each R line indicate the total empirical rate that 
results when A is chosen to be the "o" value rounded to the nearest A r (for Mode 1). The and V signs 
plot (A* avg ,R*) for the optimized Modes 2 and 3, respectively, where R* is the total empirical rate and A* v5 is 
the average of the optimal motion accuracies. From top to bottom, the "+" and signs correspond to cases 
Q= 1,5, 15, 25. 




Figure 3: Rate of "Coder 1" vs. MSE distortion for typical and optimized modes 
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Figure 4: Rate of "Coder 2" vs. A for different modes at 33 dB (Q = 20) 



Figure 3 shows the averaged rate-distortion curves for different modes of "Coder 1" . Distortion is given as 
the mean squared error between the original and compressed frames. The continuous line is the rate-distortion 
curve for Mode 1 with motion vector accuracy A=l, and the dashed line is the rate-distortion curve for Mode 1 
with A=l/2. The "x" signs indicate the rate-distortion values for optimized. Mode 1, and the "+" and V signs 
indicate the rate-distortion values for optimized Modes 2, and 3, respectively. 

Figure 4 is like Figure 2, but for encoding 30 frames of "caltrain" with "Coder 2" at a peak SNR of approxi- 
mately 33 dB (accordingly, we used Q — 20 in our formulas). Every fifth frame was intracoded by a JPEG coder 
at the same distortion and its rate is not included in the results. 



Finally, Figure 5 shows histograms of the A I( i values for optimized Mode 3 in a frame of "caltrain" when Q—\ 
(top) and Q=25 (bottom). 



5 CONCLUSIONS 



Observe that our equations and results show that in our two classical block-based video coders: 



• Lossless and lossy coding are relatively insensitive to the number of motion bits provided that A is small 
enough (because the curves of rate vs. A in Figures 2 and 4 are flat in the vicinity of their minima). 

• The rate for optimized Mode 1 is as low as that of optimized Modes 2 and 3, and hence considering different 
accuracies for different motion vectors did not help much. (This might not always be the case.) 

• Optimized Modes 1, 2, and 3 are consistently superior to Mode 1 with A = 1 and A = 1/2 (the typical 
motion vector accuracy choices of most video coding schemes). The optimized Modes give significant savings 
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Figure 5: Histograms of the A values for "caltrain". Top: 0=1. boiiom: Q = 25 



when Q takes values 1, 5, 15 (e.g., up to 0.7 bits/pixel respect to A=l), and optimized Modes 2 and 3 also 
save significantly for Q=25 (e.g., up to 0.12 bits/pixel or 17 per cent respect to A=l/2). The rate savings 
of the optimized modes are expected to further improve in scenes with higher texture content. 

• Optimized Mode 2 is slightly superior to optimized Mode 3 for large Q t probably because the two-dimensional 
least-squares fit by (8) for Mode 3 is more sensitive to distortion than the one-dimensional least-squares fit 
for Mode 2, 

• In general, the results obtained with "Coder 1" and "Coder 2" suggest that our predictions and assumptions 
are quite accurate. For example, (13) predicts fairly well the value A* where the minimum of the rate for 
Mode 1 occurs, and the optimized Modes 2 and 3 are almost as good or better than Mode 1 for very small 
values of A. On the other hand, for Q=25 in "Coder 1" the prediction (13) and the optimized Mode 1 did 
not work as well, probably because our assumptions are less accurate for very large distortions. 



Our formulas show that high motion vector accuracies are required in scenes with high texture content, and 
lower accuracies are needed in scenes where the prediction is corrupted by camera noise, occlusions, or other 
phenomena, or if the level of compression is high. Also, larger blocks will require higher motion vector accuracies 
if the texture increase in the blocks is superior to the increase of the noise floor. Some of these relationships have 
previously been observed empirically and theoretically, and some heuristic conclusions on the best motion vector 
accuracies have been taken and used in the design of video coders. 5,8,9,11 For example, Girod chooses to encode 
the motion vectors with either 1/2 pixel or 1/4 pixel accuracy. 8,9 Those may be good choices for a variety of 
scenes including "caltrain", but our equations indicate that lower accuracies will likely be a better choice for low 
textured frames encoded at large distortion, and higher accuracies will be better for high textured images at low 
distortion. 

Our work can easily be applied to other coding schemes. For example, an object-based video coder using vari- 
able block-size motion compensation could use (13), (14) and (15) to find the optimized motion vector accuracies 
and reduce the bit rate. Also, our techniques have potential applications to other aspects of motion-compensated 
video coding. For example, when coding at low rates a variation of our procedure could be used to control the 
resolution of a motion estimation algorithm on a block-by-block basis in order to save a great number of oper- 
ations, as well as to reduce the total bit rate. To see this, note from Figure 5 that for 0=25 almost all motion 



vectors are encoded with A= 1, 1/2, and 1/4 accuracy, and from Figure 2 the total rate for optimized Mode 3 is 
nearly as low as that of Mode 1 used with much higher motion vector accuracies. 



6 APPENDIX 

We want to find an expression for the expected value E{ of the energy in the ith block of the difference frame, 
as a function of the accuracies (A C| f, with which the ith motion vector is encoded. To do this, in Section 
6.1 we model the energy of the difference between a continuous block and that block displaced by a small offset, 
in terms of the offset. (Only the major steps of this tedious analysis are indicated.) In Section 6.2, we make use 
of such an analysis to find a realistic model for Ei(A Xt i , A y> i). 



6.1 The effect of small displacements on difference energy 

The block energy £ of the difference between a rectangular block V in a continuous image and the same block 
displaced by a small offset (u x ,u y ) is 

€(u X} u y ) = / / (V{x iy )-V(x + u x ,y + u y )fdxd Vi (16) 

Jx-0 Jy-0 

where V{x,y) is the intensity value of the block at the continuous (x } y) coordinates, and T x and T y are the side 
lengths of the block along x and y, respectively. Our model for the intensity values of V is a sum of sinusoids: 

V) = + ]T K m sin(w x , m x + u> v>m y + <£ m )^ b T Sl T y (x> 2/), (17) 

where the Km's, u^m's, u^m's, and <j>m*s represent the amplitudes, frequencies (radians/sec) and phases of the 
sinusoids, and &T r ,T y is the two-dimensional step function of the rectangular base for the block: 

, , . f 1 0<x<T Xt 0< y<T y nA \ 
»t..t,(«.v)=( Q oth ~ erw r (18) 

Combining (16), (17) and (18), assuming that M is greater than or equal to 1, that u x and u y are small compared to 
T x and T v , and that the difference between any two Wj )m 's and u^.m's is larger than 2/T x and 2/T y , respectively,* 
and using simple but tedious manipulations, we find 

£(u Xi u y ) « 2K 7 0 (T x T y -(-\u x \+T x )(-\u y \+T y )) 

M 

- kl(TxT y -cos(w x , m u x +u> y , m u y )(- \u s \+T x )(- \u y \+T y )), (19) 

m=l 

Using a Taylor expansion for the cosine, and neglecting small terms, we obtain 

i(u z ,u y ) « f t.tv x: «s + ( x: ^p^j (20) 

Therefore, the block energy £ of the difference between a block V in a continuous image and the same block 
displaced by a small offset (u X) u y ) follows approximately a quadratic formula: 

£(u xt Uy)KAu 2 x + Bui ( 21 ) 

where A and B are the two quadratic parameters in (20), which depend on the size T x T y and the texture 
parameters {K m } of the block V. 

t A new preliminary analysis indicates that this assumption can be relaxed. 



6.2 The quadratic model for the expected difference energy Ei 



Here let Vi be the underlying continuous version of the block Pi in the current frame to be encoded, and let 
the continuous block V\ be the ideal best match for Vi in the decoded previous frame. We can write 

VK*,v) = 1>i(*,v) + N(x,v), (22) 

where N(x t y) models the differences between Vi and V[ due to illumination changes between frames, camera noise, 
distortion, etc. Let P" be the prediction for P t used by a practical encoder with amotion estimation/quantization 
procedure of (A X( ,-, Aj,,,-) accuracy, and let V" be the continuous version of P". Then, we model 

VV{x,y) = V[{x + u x ,y + u v ), (23) 

where the motion vector errors u x and u y are values in (-A I( »/2, A xi /2) and (~A !/} »/2, A y)i /2), respectively. 
The energy in the ith block of the difference frame is 

* = £ £ - ^"(^.J/p)) 2 • (24) 

x r =Qy p -Q 

where Pi{x p ,xj p ) and P"{x Vl y p ) are the pixel values of blocks P,- and P/', respectively, at the discrete block 
coordinates (x p ,y p ) t and T is the side length of the block. We approximate Si (24) by 

* * / T / T 2/) ~ 1/))' rf* dy. (25) 

Combining (21), (22), (23), and (25) produces the following approximation for the difference energy ft associated 
to Pi 

+ y + u„) + N(x + u x ,y + u y ))) 2 dx <fy 

Jx=Q Jy=0 

= [ f ('Pi(x ) y)^Vi(x^u X} y + u y )) 2 dxdy^Ci{u X) Uy) 

« A u x + ft u 2 + Ci(u„ ti,), (26) 
where Ai and ft are the quadratic parameters for Vi, and 

d(u X) u y ) = / / N(a; + u x ,i/ + u y ) 2 dzdy 

-2 / / //(a: + u x , 2 / + « J ,)(A(a? ) y)-'P < (x + u 3r ,t/ + u I ,)) rfx eft/. (27) 

JxzsO Jy=0 

Finally, we model N(x,y) as a realization of zero-mean stationary noise with variance v 2 N , and u x and tz y as 
the outcomes of two random variables (independent of the N(x,yYs) uniformly distributed in (— A^/2, A x>l /2) 
and (— Ay.i/2, respectively. Combining (26) and (27), and computing the expected value, we obtain the 

desired model for the expected difference energy Ei 

Ei = E[€i] « A + ft + C- = /It A 2 it - + ft A*< + Ci, (28) 

with d = <?%T 2 , A = A/12, and ft = ft/12. The approximation in (28) improves for smaller values of u x and 
u yt and for larger values of T. In practice, this quadratic model for ft appears to work well for a wide variety of 
blocks in real images if u x and u y are smaller than or equal to 2, and T is larger than or equal to 8. 

Remark: The quadratic model (26) appears to match the empirical results obtained by Vandendorpe ei ai. 10 
(Note that d(u Xi u y ) is approximately constant for small values of u x and u y .) 
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Super-resolution enhancement algorithms are used to esti- 1. INTRODUCTION 

mate a high-resolution video still (HRVS) from several low- 

frame enhancement algorithm is presented to compute an t0 esUmate a h, B h - r «olution still from several low-resolu- 

HRVS using the spatial information present within each frame tlon v,deo himes > provided that objects within the image 

as well as the temporal information present due to object motion sc 9 ue nce move with subpixel increments [1]. To properly 

between frames. However, the required subpixel-resolution mo- incorporate temporal correlations into the multiframe ob- 

tion vectors must be estimated from low-resolution and noisy servation model, high-quality subpixel motion vectors must 

video frames, resulting in an inaccurate motion 6eld which can be estimated between video frames [2-4], Methods for 

adversely impact the quality of the enhanced image. Several attaining subpixel accuracy generally employ an interpola- 

subpixe! motion estimation techniques are incorporated into tion of the image sequence frames, followed by the applica- 

the Bayesian muldframe enhancement algorithm to determine tion of a parame tric or nonparametric motion estimation 

the* efficacy in the presence of global data transformations scheme , ^ ac of J moU fi ™ 

between frames (i.e., camera pan, rotation, tilt, and zoom) and ^ a . ' , luu utiU *> 

independent object motion. Visual and quantitative compart U ^^S^S^ ° f hi ^' T ^ iim 

sons of the resulting high-resolution video stills computed from Vl ^f StlU ( HRVS ) un ^ e - 

two video frames and the corresponding estimated motion fields . The conce P l oi> nauldframe enhancement was originally 

show that the eight-parameter projective motion model Is ap- introduced by Tsai and Huang [5], in which an observation 
'^reojmate^oixglebakrecra sequence »<ionKUt mg , mbptnet - r • 

Horn-Schunck optical flow estimation each have their own shifts of the same scene. Stark and Oskoui formulated a 

advantages and disadvantages when used to estimate indepen- projection onto convex sets (POCS) algorithm to compute 

dent object motion, © i»e an estimate from observations obtained by scanning or 

— — rotating an image with respect to the image acquisition 

sensor array. Tekalp el al [6] first extended this POCS 

•This research was supported in part by the National Science Founda- formulation to include sensor noise and then Patti m 

tion Faculty Early Career Development (CAREER) Program, Grant ; ° ™~ "t". ? / 7 * I n °T' u ^ l?i 

J^i?62^ 

...iSJrSkM P?S!S!i!2^SlSl 5S onth P- Bayesian method^ of image . 

. Miftffi ..ttoeird ..Visual ^mmm^m -and . . . segftence frame int^atXon. wii£ mtrodweed. by^O*!^" 

Image Processing *97, (San Jose, CA), February. 1997. man [8] and Schultz and Stevenson [1 j. The multiframe' 
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resolution enhancement algorithm proposed by Schultz 
and Stevenson [1] uses a Bayesian maximum a posteriori 
(MAP) estimation technique which incorporates a discon- 
tinuity-preserving prior to integrate progressively scanned 
video frames. A more general video observation model 
was also defined for the motion-compensated scan conver- 
sion of interlaced image sequences [9j. This research ex- 
tends the Bayesian HRVS algorithm by comparing several 
motion estimation techniques, including the estimation of 
the eight-parameter projective model coefficients [10] t con- 
ventional block matching motion estimation [11, 12], and 
Horn-Schunck optical flow estimation [13], to determine 
the efficacy of these methods with respect to the quality 
of the corresponding multiframe enhanced images. 

The paper is organized as follows. Section 2 describes the 
Bayesian multiframe resolution enhancement technique 
which will be used in the simulations. An overview of three 
motion estimation techniques is provided in Section 3. The 
detection and elimination of inaccurate motion vectors 
may help improve the quality of localized regions within 
a high-resolution video still, and an algorithm used for this 
purpose is also proposed in Section 3. Simulation results 
are presented for two image sequences in Section 4, includ- 
ing a synthetic image sequence with a known subpixel 
camera pan between frames and an actual image sequence 
containing objects which move independently. A brief con- 
clusion and future research directions are provided in Sec- 
tion 5. 

, 2. BAYESIAN MULTIFRAME 
RESOLUTION ENHANCEMENT 

A short low-resolution digital image sequence consisting 
of M frames is defined as 



{y">} for/ = fc- [^ij it -1,*, 

k + l *+ P^]. 



(1) 



where y { *> is the reference frame situated at or near the 
center of the sequence. In this expression, UJ denotes the 
largest integer less than or equal to jc, and fx] represents 
the smallest integer greater than or equal to x. The goal 
of single frame image enhancement techniques — most no- 
tably bilinear [14], cubic B-spline [15], and Bayesian MAP 
interpolation [16] — is to estimate the unknown high-reso- 
- - * ^ 

- 7^-H°weveT;*obje^ 

With ^ 

tional temporal information can be utilized to improve 
the quality of the high-resolution estimate of z ( *>. In this 



section, a motion-compensated subsampling model is de- 
scribed, along with the resulting Bayesian multiframe reso- 
lution enhancement algorithm. 

2.1. Image Sequence Observation Model 

The subsampling model which maps q X q high-resolu- 
tion pixels within frame k, denoted as z&\ into a single 
low-resolution pixel yi*> at spatial location x « (jc, , jt^) is 
given as 



tf-af 2 2 iff) 



(2) 



for x } » 1, . . . , jV, and x 2 « 1, . . . , N 2 . This represents 
the integration of light over a single low- resolution charge- 
coupled device (CCD) image acquisition sensor containing 
q X q high-resolution sensors within its spatial area. In 
matrix-vector notation, this can be expressed as 



y(*) = A<*-*>Z<*>, 



(3) 



where A<* *> € R N i N i*f N \ N i is referred to as the subsampling 
matrix. Intuitively, each row of A (M > maps a square block 
oiqxq high-resolution samples into a single low-resolu- 
tion pixel. 

If the motion vectors are known exactly, the k A frame 
after compensation should be identical to the / lh frame, 
such that 



u<0 = v <*) 



(4) 



where the vertical and horizontal displacement vector com- 
ponents at location x « (j^, x 2 ) are denoted as v(x) and 
h(x), respectively. These displacement values may be frac- 
tional to represent subpixel-resolution motion between 
frames. Assuming that all motion vectors between frames 
k and / are known exactly and that purely translational 
motion can be used to represent the movement of objects 
between the two image sequence frames, the motion-com- 
pensated subsampling model is given as 



, (,i>u t l ;.i uuOa u 



v (5) 



■?( 2 



Therefore, the overall high-resolution to low-resolution 



,'.^•.•.^^.,•.•.•.v,^^■,.,^^^■ l ■,•A., „, 



A"-*> has a structure similar to A ( *«*\ but with the summa- 
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tions over shifted sets of pixels. Since the rows of A (Mt) 
which contain useful information are those for which ele- 
ments of y (,) are observed entirely from motion-compen- 
sated elements of z (k \ low-resolution pixels which are not 
completely observable must be detected so that the corre- 
sponding rows of A (U) can be deleted in the construction 
of the reduced matrix A t(i *\ Practically, the motion vectors 
must be estimated using one of the techniques to be pre- 
sented in Section 3. 

A practical video observation model useful for real im- 
age sequences is given as 



(7) 



where A' (M:) is an estimate of the motion-compensated 
subsampling matrix which both subsamples the high-reso- 
lution frame and takes into account the motion between 
frames, and nP tk * is an additive noise term representing the 
error in estimating A' (,,Ac) from y (lt) and y (/) . Additive noise 
in this case is assumed to be Gaussian-distributed. The 
notation y' (/) denotes only those pixels within y^ which are 
observable entirely from motion-compensated elements of 

z (k) 

22. .High-Resolution Video Stills (HKVS) Algorithm 

To estimate a high-resolution video still from a low- 
resolution image sequence, a Bayesian MAP estimation 
technique [1} is employed. The Bayesian estimate com- 
puted from M low-resolution frames is given as 

t<*> - arg max log fr( 2 <%'<*-U"-i)/zj) y i<*-i )( 

(8) 



)/2l) ). 



Applying Bayes* theorem to the posterior probability re- 
sults in the following optimization problem: 

± ik > = arg max {log Pr(z<*>) + log Pr(y J >'*1> 

* (9) 



sity, and the conditional density models the error in esti- 
mating the motion vectors used to construct the motion 
compensated subsampling matrices A (U) . The motion esti- 
mation error is assumed to be independent between 
frames, so that the joint conditional density may be writ- 
ten as 



Since the signal z {k) and the noise n (U) are statistically 
uncorrected, each of the conditional densities for / 1* k is 
an independent noise probability density function; 



Pr(y'<'>|z<* } ) - Pr(A'<'- A >z<*> + n<'-*V*>) 
- Pr(n<">|z<*>) 
-Pr( n <W). 

Thus, the optimization problem in (9) becomes 



(") 



4<*> «= ar g max 



logPr(z<*>) + logPr(yW|i<* > ) 



+ 2 logPr(y'<'V* > ) 



1 



(12) 



In order to compute the MAP estimate, the prior image 
model Pr(z (i) ) and the conditional densities Pr(y<*>|z<*>) and 
Pr(y'<'y*>) for / * k must be defined. 

The Huber-Markov random field (HMRF) model fl] is 
a Gibbs prior [17] which represents piece-wise smooth 
data, with the probability density function defined as 



Pr(z<*>)»lexp{-^2 



(13) 



In this expression, Z is a normalizing constant known as 
the partition function, is the Gibbs prior "temperature" 
parameter, and c is a local group of pixels contained within 
the set of all image cliques (f. The quantity d^z<*> is a spatial 
activity measure, with a small value in. smooth image loca- 
tions and a large value near edges. Four spatial activity 
measures are computed at each pixel in the high-resolution 
image, given by the following second-order finite differ- 
ences: 



dU w = 0.5 2 <« Url - + 0.5z<*> +Uj+1 . 

The likelihood of discontinuities is controlled by the Huber 
edge penalty function [1], 



= Pr(y(*)|z<*>)X n Pr(y'<V } ). (10) W ' M * 



where a is a threshold parameter which separates the qua- 
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dratic and linear regions. A quadratic edge penalty, lirn^. 
Pa(x) m x 2 , characterizes the Gauss-Markov random field 
prior model The use of the Huber function with the proper 
value of a helps maintain discontinuities in the image es- 
timate. 

Each conditional density, Pr^V**) and ?r(y ,{t) \z lk) ) for 
/ * *, models the error in the displacement vector estimates 
used to construct the motion-compensated subsampling 
matrices A' <a) . Since motion vectors are not required for 
frame y (k) , the density Pr(y< k) \z {k) ) is given as 



p r (yW| X <*>) 



For other frames in the short sequence, the error in the 
video observation model is assumed to be Gaussian-distrib- 
uted; therefore, each conditional density is given by the ex- 
pression 



Pr(y'<'>|z<*>) = 



1 



for / - k - l(M - l)/2l . . . , k - 1 and / = k + 1, . . . , 
k + f(Af - l)/2*l The error variance a***? for each frame 
is assumed to be proportional to the absolute frame index 
difference |/ - k\. 

By substituting Eqs. (13), (15), and (16) into (12), the 
Bayesian MAP estimate of the high-resolution image can 
be written as the constrained optimization problem 



: arg min 



3. SUBP1XEL MOTION ESTIMATION TECHNIQUES 

Estimating accurate subpixel motion vectors is a criti- 
cally important component of modeling an image sequence 
for use in super-resolution enhancement algorithms. The 
eight-parameter projective model coordinate transforma- 
tion [10], block matching [12], and Horn-Schunck optical 
flow motion estimation [13] are described in this section. 
In all cases, two successive low-resolution video frames 
arc first up-sampled by a factor of q using either bilinear, 
cubic B-spline [15], or single frame Bayesian MAP interpo- 
y (A) « A <*.*) 2 <*) lalion l 16 l- The resulting up-sampled frames are denoted 

(15) as y Hk) and y f(/> ; subpixel-rcsolution motion vectors are 
y ( *> * A (k ' k h {k \ then estimated from these two frames. Since a single sub- 
pixel-resolution motion vector is required for each low- 
resolution image pixel in the image sequence observation 
model, the estimated motion vectors are down-sampled by 
averaging over q X q vector blocks. 

5.7. Eight- Parameter Projective Model 

Parametric coordinate transformation algorithms as- 
sume that objects remain stationary while the camera or 

(16) the camera lens moves; this includes transformations such 
as pan, rotation, tilt, and zoom. If an image sequence con- 
tains a global transformation between frames, the esti- 
mated motion field can be highly accurate due to the large 
ratio of observed image pixels to unknown motion model 
parameters. The parametric model which corresponds 
most closely to transformations that occur in the real world 
is the eight-parameter projective model (10]. The model 
is defined as 

_ CM + a 2 x 2 + a 3 , _ gjfi + a*x 2 + at 
flr*i +<fcr 2 +r 2 Qnx, + a*x 2 + 1 ' 

where x - (x, , x 2 ) denotes the spatial location of a pixel 
in frame /, x' « (*{,*£) represents the location of the 

(17) transformed pixel in frame *, and a = , . . . , denotes 
the eight unknown model parameters. This model can be 
expressed in a more compact matrix-vector form as 



X2po(<i;,*<*>) 



where the constraint set is defined as 



•-[:;]- 



As + b 



(20) 



(18) 



Each frame, y' (/) for / & k y has an associated confidence 
parameter,-^ 

in - the ^otion^compeitsated subsamplltig rriiitrrX estirnate^ 
- the 
gradient projection technique [1, 18] is used to compute 
the unique solution l ik \ 



The matrices and vectors have the following dimensions* 
A 6 R 2 * 2 , b G R 2 * 1 , and e € R 2xl . 

The "four point" method proposed by Mann and Picard 
[TO]' is 

^ottetcbefficie^ 

corresponding to four transformed points in the up-sam- 
pled reference frame y » <*>, eight nonlinear equations result 
for the eight unknowns. Solving these simultaneous nonlin- 



42 



SCHULTZ. MENC. AND STEVENSON 



ear equations directly is not possible. To circumvent this 
problem, a bilinear motion model (19] approximation to 
the projective model is used, such that 



(21) 



Now, eight linear equations result for the four selected 
points, resulting in a least squares solution. The projective 
model parameters are then related to the bilinear coeffi- 
cients in an iterative algorithm, in which the up-sampled 
reference frame is iteratively warped into y T(/ ' until the 
projective model parameters converge. 

5.2. Block Matching Motion Estimation 

The primary drawback of using parametric motion mod- 
els is that they are only applicable in the presence of global 
motion. Simply stated, it is expected that they will fail 
when objects move independently. Nonparametric models 
do not possess this problem, so they may be used to esti- 
mate independent object motion [19]. However, due to the 
low ratio of observed image pixels to unknown motion 
model parameters, a number of the estimated motion vec- 
tors may be inaccurate. 

Block matching (12) is a popular approach for estimating 
motion vectors from image sequences. This method as- 
sumes that the motion field is uniform over compact blocks 
of pixels and that the motion can be modeled as displace- 
ments of these blocks. Let a (2p + 1) x (2p + 1) pixel 
region in the up-sampled frame y i f/ > represent the block 
to be matched between frames k and /, with pixel locations 
contained in the set 

5=8 i(r t s)\xi - p r < x, + p\ x 2 - p ^ s s£ x 2 + p}. 

The maximum allowable vertical and horizontal displace- 
ment of this block is d pixels, with this parameter defining 
the extent of the search area in the up-sampled reference 
frame y f Thus, candidate motion vectors are limited to 
the set 

Using this notation, a single displacement vector can be 
estimated using the mean absolute difference (MAD) crite- 
rion as 



Block matching generally performs well if a number of 
spatial discontinuities arc present within the block, but the 
technique fails to estimate vectors properly over flat image 
intensity regions. A large block size, say,p * 5, may be used 
to ensure that a sufficient spatial variation exists within the 
block. 

33. Horn-Schunck Optical Flow Estimation 

Horn-Schunck optical flow estimation [13] results in 
motion vector estimates which satisfy the optical flow equa- 
tion with the minimum pixcl-to-pixel variation in the veloc- 
ity field. Assume that continuous image intensity, denoted 
as £(x, , x 2l f), is constant along a particular object trajec- 
tory, such that the optical flow equation is given as 

,^2,0 = E(x x + kx it x 2 + Ax 2 ,r + A/) Vx, , x 2t r. 

(23) 

Through a first-order Taylor scries expansion of the right- 
hand side of Eq. (23), the optical flow equation can be 
rewritten as 

*oA» (x), Mx)) - v (x)£, t + h<x)£ t2 + £, ~ 0, (24) 

where the translational flow velocity vector consists of spa- 
tial derivatives; i.e., 



(25) 



In the discretized optical flow equation, spatial derivative 
estimates, £ X| and £, 2 , and the temporal derivative esti- 
mate, can be calculated from the interpolated frames 
as finite differences [13]: 



- mSSh ~ x« + - Aft* 



(tf(x), h(x)) - arg min 



mi 



totxi => 1 Ni and* 2 = 1 N 7 . 



+ y!,% - + yi v. - 

The motion field is then estimated using the following cri- 

temv, ...... ..... . ,, v....,,,,..,,.,.......,,........,, 

"""" ^;W«^t;^W)J 



(26) 
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The smoothness measure, e?(u(x), /i(x)), is added to the 
optical flow constraint to ensure thai the displacement 
vectors vary only slightly over a small neighborhood. For 
each point wiihin the image, the smoothness measure is 
defined as 



(27) 



The parameter r 2 controls the influence of the smoothness 
term, and thus the smoothness of the estimated optical flow 
field. Minimization of (26) can be performed by solving a 
pair of linear equations using Gauss-Seidcl iterations [13]. 
Simulations have shown that it takes at least 2000 iterations 
to achieve convergence, even for a pair of relatively small 
video frames. 

Theoretically, Horn-Schunck optical flow estimation 
should yield more accurate motion vectors than block 
matching, since the vectors arc correlated throughout the 
motion field. However, estimating accurate gradients from 
image data is difficult, and for this reason ej/(u(x), h(x)) 
may be a poor estimate of the true constraints, in practice, 
motion fields estimated by applying the Horn-Schunck 
technique to actual image sequence frames contain a num- 
ber of errors, since they are computed by minimizing an 
inaccurate objective function. 

3.4. Detection and Elimination of Inaccurate 
Motion Estimates 

In the multiframe enhancement algorithm, an inaccurate 
motion vector can cause a great deal of damage to the 
estimated high-resolution video still, even more so than 
by assuming that there is no motion at that point. There- 
fore, the detection and elimination of inaccurate motion 
vectors is an important step in the enhancement algorithm 
to improve the overall quality of the high-resolution 
video still. 
v^JDiejiisplac^vf^^ 
tween the up-sampled frame / and the compensated image 
from frame k is denoted as 



The mean, jx DFDf and variance, ctdfd* of the DFD are 
denoted as 



rt" = a^T 2 (i> W> - /W. (30) 

where N is the total numbcr of pixels under consideration. 
Since it is assumed that the motion vectors along the bor- 
ders of each frame are not useful due to pixels entering 
and leaving the scene, N is given as N% X N 2 minus the 
number of excluded border pixels. A motion vector at 
point x is detected as an inaccurate estimate if DFD$ k) 
exceeds a threshold T which is dependent on the signal 
statistics. The threshold value used in the simulations is 
given as 



(31) 



A simple change detection algorithm examines the 
straight frame difference (FD) between the up-sampled 
frames k and /, • 



(32) 



If FD<'*> is small, this implies that the intensity of pixel x 
in frame / is almost the same as the corresponding pixel 
in frame /r. Therefore, the motion vector at this point will 
be difficult to estimate using the block matching and optical 
flow techniques, and it may have to be eliminated from 
the vector field. 

In the simulations, a motion vector estimate at point x 
is considered to be accurate if the following two conditions 
are satisfied: 

DFD« k ><T r FD<! k >>2 

If an inaccurate ^motion vector estimate is detected at loca- 
tion x, v (x) and h(x) are ignored, and the pixel is considered 
JtaJae^oiasj^ahleia 



(28) 



Ideally, the DFD should be zero if the displacement vectors 
describe the rr^^ 
•^ncrrassocialedwTtfrthem^^ 

* v^ a m M TRcrel^TQMi 'DFD* lias a nonzero '^ue^wfiTdG 
can serve as a criterion of how well the motion vectors 
have been estimated. 



4. SIMULATIONS 

In order to compare the effects of the motion estimation 
techniques on the quality of the multiframe enhanced im- 
ages computed by the Bayesian HRVS algorithm, two im- 
age sequences were used in the simulations. The first image 

se^ 

'tally g£n 6 rated " Irorn ' a ' 'single 'aTei tal ' Image * to' m'tf el * "a ' 
diagonal camera pan with known motion vectors. The sec- 
ond image sequence, Mobile and Calendar, is shown in 
Fig. 2. This sequence is composed of several objects moving 
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HO. 1. Airport image sequence: (a) original high-resolution, video frame (b) low-resolution video frame y<*> (q = 2); (c) low- 
resolution video frame y**' 0 (q - 2). 



independently. To show the effect of an estimated motion when more than two frames arc integrated using the Bayes- 

field on the quality of a high-resolution video still, only ian HRVS algorithm. 

these experiments. Thus, a high-resolution video still is factor of q (q = 2 or q - 4) and then interpolated back to 

computed from two low-resolution video frames and the their original dimensions so that quantitative comparisons 

single motion vector field that is estimated between them, could be made using the improved signal-to-noise ratio 

Jt must be emphasized that we are attempting to determine (SNR). The improved SNR is a quantitative measure of 

empirically how the accuracy of an estimated motion field how much the estimated frame S<*> has improved over the 

affects the quality of the high-resolution video still; with reference frame, given as 
this in mind, only a single motion field (and consequently, 

" tv^iow-re-solution^ • • - •"*■■■ ——-•«•■ -| K ttKa^ip--'^-.^v, 

-ments; To obtain imprcrvedtesbhitibri,' more f fames could v ^ st ** ~ **^**jg?(r S ' - •(» ^B)- {33)- 

ous research results by Schultz and Stevenson [1] to exam- In this expression, 4* } is generated by a zero-order hold 

ine the visual and quantitative results that are possible up-sampling of the reference frame y<*\ z<*> is the original 



B es t Availab le C opy 




high-resolution image, and £<*> is the estimated high-resolu- 
tion video still image. It should be noted that this quantita- 
tive measure can only be used in idealized test cases where 
z (k) is known. 

The procedure used to compute the multiframe high- 
•esokrtion^sttnitH^ 

1. Up-sampte both low- resolution frames y (A) and y (i) by 
a factor of q = 2 or q - 4, using (a) bilinear interpolation, 
(b) cubic B-spline interpolation, and (c) single frame 
Bayesian interpolation (M = 1, a * 1.0). The resulting 
up-sampled frames are denoted as y T(ic > and y t(/) . 

2. Estimate the subpixel-resolution motion vectors be- 

- Jective* modet estimation-f ^oot p6i*n 
mating 

flow estimation (t 2 = 10.0, 2000 iterations). Detect and 
eliminate inaccurate motion vector estimates. Finally, 



down-sample the displacement vectors by averaging 
q x q vector blocks. 

3. Compute the Bayesian multiframe HRVS for each 
motion field using the following parameters: M = 2; 
<? = 2org = 4;a = 1.0; and A< u > = 5,0/|/ - k\. 

1 ables 1 and 2 provide a quantitative comparison of the 
single frame and multiframe enhanced estimates computed 
from the Airport and Mobile and Calendar sequences, re- 
spectively, while Figs. 3 and 4 provide a visual comparison 
of several multiframe estimates computed using the three 
different subpixel motion estimation techniques. For all 
multiframe estimates shown in Figs. 3 and 4, single frame 
Bayesian -interpolation* was used' to up<sampte' ' the- X<m- 
resolution frames -so rhat the- subpixer motion -vector -fields * 

couia^^ 

quantitative comparisons are as follows: 
L Provided that the motion is estimated correctly, val- 
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TABLE 1* 

Comparison of Airport Sequence High-Resolution Video Stills 
A jwt (in dB) of Single Frame and Multiframe Resolution Enhanced Estimates 



Sin sic frame ur>jamoline alcorithm/mmmn 
estimation technique 




<?-2 




q m 4 


All 


Accurate 


All 


Accurate 


Single frame bilinear interpolation 
Single frame cubic B-splinc interpolation 
Single frame Bayesian interpolation 




0.93 
3.08 
3.44 




0.59 
1.40 
1.67 


Bilinear inicrpolation/eighi-paramclcr 
Bilinear interpolation/block matching 
Bilinear interpolation/ Horn-Schunck 


6.85 
6.12 
3.08 


7.08 
7.18 
3.17 


2.24 
2.30 
1.79 


2.38 
2,34 
1.85 


Cubic B-spline interrx>laiion/eight-paramctcr 
Cubic B-splinc interpolation/block matching 
Cubic B-spline interpolation/Horn- Schunck 


6.17 
6.11 
2.84 


7.17 
7.14 
3.07 


2.30 
2J4 
1.77 


2.70 
Z38 
1.78 


Bayesian intcrpolation/eight-parameter 
Bayesian interpolation/block matching 
Bayesian interpolation /Horn-Schunck 


7.00 
5.96 
2.82 


7.18 
6.88 
2.96 


2.42 
2.23 
1.57 


152 
2.20 
1.70 



• 14 All" represents the use of all estimated motion vectors, while "Accurate" denotes that the inaccurate 
motion vector estimates have been delected and eliminated prior to the application of the multiframe 
resolution enhancement algorithm. A, w values (in dB) represent the quantitative improvement of the 
high-resolution video still over the low-resolution reference frame. 



ucs of &snr should be larger for the multiframe estimates 
than for the single frame estimates. This is quite evident 
for the multiframe Airport estimates in Table 1 which used 
either the eight-parameter projective model or the block 



matching motion vectors and the Mobile and Calendar 
estimates in Table 2 which used either the block matching 
or Horn-Schunck optical flow motion vectors. 
2. The quality of the subptxel-resolution motion vector 



TABLE 2- 

Comparison of Mobile and Calendar Sequence High-Resolution Video Stills 
Am (in dB) of Single Frame and Multiframe Resolution Enhanced Estimates 



Single frame up-sampling algorithm/motion 
estimation technique 

Single frame bilinear Interpolation 
Single frame cubic B-splinc interpolation 
Single frame Bayesian interpolation 

Bilinear interpolation/eight-parameter 
-^ftTnb^mte^^ 
Bilinear interpoIation/Horn-Schunck 

Cubic B-spline interpolation/eight-parameter 
Cubic B-spline interpolation/block matching 
Cubic B-spline interpolation/Horn-Schunck 

Bayesian interpolation/eight-parameter 
Bayesian interpolation/block matching 
' fiay&iM m'te 



9 = 2 



All 



Accurate 



All 



0.27 
1,70 
2.41 



q ■ 4 

Accurate 

035 
076 
1.12 



-0.13 


-0.04 


1.01 


1.03 










2.74 


2,95 


1.23 


1.23 


0.56 


0.85 


1.08 


1.08 


2.83 


3.07 


1.29 


132 


2.54 


184 


1.16 


148 


0.27 


1.01 


1.10 


1.12 


2.49 


2J86 


1-25 








U5 * 


1.17 



• • • • «P«»eW4«»ei^ 

motion vector estimates have been delected and eliminated prior to the application of the multiframe 
resolution enhancement algorithm. \ s „* values (in dB) represent the quantitative improvement of the 
high-resolution video still over the low-resolution reference frame 
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estimates is dependent upon the accuracy of the up-sam- 
pled image data. Bilinear, cubic B-spline, and single frame 
Bayesian MAP interpolation (M « I, a = 1.0) were used 
to up-samp)e the low-resolution video frames prior to the 
application of a particular motion estimation method. Em- 
pirically, it was determined that both cubic B-spline and 
Bayesian MAP interpolation provide useful and very simi- 
lar up-sampled frames, while bilinear interpolation is 
rather ineffective. 

3. The performance of the three motion estimation algo- 
rithms varies for the test image sequences. In the case of 
global motion, the eight-parameter projective model can 
recover the motion extremely well. As expected, the algo- 
rithm performs poorly on sequences which contain objects 
that move independently. On the other hand, block match- 
ing and Horn-Schunck optical flow estimation perform 
quite well in this case, but not for global data transforma- 
tions. It is expected that the Horn-Schunck algorithm 
should perform belter than block matching for the general 
case of independent object motion. However, as shown 
empirically, this is not always the case. Estimating the 
spatial and temporal gradients as required by the Horn- 
Schunck algorithm results in an inaccurate objective func- 
tion and, consequently, in many inaccurate motion vectors. 

4. In most cases, the multiframe high-resolution images 
and their corresponding A. w * values show that the quality 
of a particular HRVS is improved after inaccurate motion 
vectors have been detected and eliminated. 

5. For the interpolation factor q = 4, Tables 1 and 2 
show that the quality of the multiframe HRVS computed 
using A< = 2 frames does not differ significantly from the 
single frame Bayesian estimate (M ~ 1, a = 1.0). This 
implies that the estimated motion vectors are quite poor for 
q > 2, and that the subpixel-resolution motion estimation 
schemes used in this research do not perform well for 
practical interpolation factors. Parametric techniques can 
be useful for estimating higher-resolution motion fields 
between frames containing global motion (e.£. t reconnais- 
sance and satellite image sequences). Estimating high-reso- 
lution motion fields between frames containing indepen- 
dent object motion is still an open research area, as block 

,^^majcj^ 
adequately. 




FIG. 3. Details of Airport sequence high-resolution video stills, 
q <= 2: (a) low-res frame, (b) low-res frarae, y<*-«>; (c) high-res 
frame, z' \ (d) single frame Bayesian estimate, A«« * 3.44 dB- (e) 
multiframe HRVS, eight-parm model, ail vectors, A 5W? « 7 00 dB* 
) *mu Inf rHme*-TiRV ... 
-?.*8-dB^g)™ulttfr^ 

••■ 5;9«5.dB;- fh)- nwl U^ame V^ - •hiock-^matehrng i '-'accttFate vectorey- 
<W - 6.88 dB; (i) multiframe HRVS, Horn-Schunck, all vectors' 
A™* « 2.82 dB; (j) muldframe HRVS, Horn-Schunck, accurate 
vectors, A JW * - 2.96 dB. 



^ffT^ , : >3i;. jit j. w. 
« •' '-.'4 * . •' J 1 ; 



a.* ^''"ifejic ! 
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»*« 









FIG. 4. Details of Mo6i/e and Calendar sequence high-resolution 
video stills, q = 2: (a) low-res frame, y < * ) ; (b) low-res frame* /* 4l) ; 
(c) high-res frame, (d) single frame Bayesian estimate, b s »* » 
2.41 dB; (c) multiframe HRVS, eight-parm model, all vectors 



&j.va ='027'dB; (f) muliffraroe HRVS", eTght-parm mbo^ accWate* 



vectors, Asm = 1.01 dB; (g) multiframe HRVS, block matching, all 
vectors, A S „ R - 2.49 dB; (h) multiframe HRVS, block matching, 
accurate vectors, & S nk ■ 2.86 dB; (i) mulliframe HRVS, Horn- 
Schunck, all vectors, A**a » 2.58 dB; (j) multiframe HRVS, Hom- 
Schunck, accurate vectors, A w * = 2.79 dB. 



Sfujj^ 

are used to estimate a high-resolution video still from sev- 
eral low-resolution video frames, provided that objects 



within the sequence move with subpixel increments. An im- 
age sequence observation model was presented for low-res- 
olution frames, which models the subsampling of the un- 
known high-resolution data and accounts for independent 
object motion occurring between frames. Three single 
frame enhancement methods were used in the experiments, 
including the bilinear, cubic B-spline, and single frame 
Bayesian interpolation techniques. The Bayesian multi- 
frame enhancement algorithm was described to recon- 
struct a high-resolution video still from several low-resolu- 
tion image sequence frames. In this algorithm, estimating 
accurate subpixel-resolution motion vectors is critically im- 
portant. The eight-parameter projective parametric trans- 
formation, block matching, and Horn-Scbunck optical flow 
estimation techniques designed for subpixel motion estima- 
tion were utilized in the computer simulations. The resulting 
high-resolution video stills computed using the various esti- 
mated motion fields were analyzed both visually and quanti- 
tatively to determine the most useful motion estimation 
technique. It was determined that the parametric motion 
model results in accurate multiframe video estimates when 
a global transformation such as a camera pan, rotation, tilt, 
or zoom occurs between frames. Optical flow equation- 
based techniques lend to perform well when objects move 
independently in front of a stationary .camera lens. Block 
matching can recover the motion well when the sequence 
contains a large number of spatial discontinuities and few 
flat image regions. All motion estimation techniques are in- 
herently imperfect, and it is easy to see the locations of inac- 
curate motion estimates within a high-resolution video still. 
An algorithm for the detection and elimination of inaccu- 
rate motion vectors was described, and simulations verified 
that the HRVS is improved when the inaccurate motion vec- 
tors were eliminated from the motion field. 

A number of issues will be explored in future research. 
First, the Horn-Schunck optical flow estimation method 
will be modified to allow discontinuities within the motion 
field by incorporating the Huber edge penalty function into 
the smoothness measure [20], The goal is to automatically 
segment the motion field into regions representing inde- 
pendent objects. Furthermore, the motion of each object 

motion estimate can be obtained by averaging over the 
motion field region where an individual object is undergo- 
ing a transformation between frames. Secondly, an itera- 
tive motion estimation approach will be implemented for 
the multiframe enhancement algorithm. Explicitly, multi- 
frame estimates will serve as the up-sampled frames, and 
then . the subpkd-resolution motion vector estimates will 

rates motion-compensated data subsampling and a point 
spread function (PSF) to represent blur due to an out-of- 
focus camera lens or object motion. 
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APPENDIX: SYMBOLS 



M 
z 

Z (A) 



number of rows within a digital video frame 
number of columns within a digital video frame 
number of frames within a digital image sequence 
lexicographically-ordered digital image, repre- 
sented as a vector 

frame k within a digital image sequence 
pixel at spatial location x ~ (*, , x 2 ) within the *** 
frame 

an estimate of the vector z 
Euclidean norm in real space 
Huber edge penalty function, dependent on pa- 
rameter a 

spatial activity measure computed at pixel z[ k) 
Huber edge penalty function threshold parameter 
Gibbs prior temperature parameter 
smoothing parameter 
mean or average value 
Gaussian noise standard deviation 
probability density function 
exponential function 

operator returning the least integral value greater 
than or equal to its argument 
operator returning the greatest integral value less 
than or equal to its argument 
constraint set 

integer-valued interpolation factor 
vector of useful observations, with reduced di- 
mensionality 

matrix of useful constraints, with reduced row 
dimensionality 

improved signal-to-noise ratio, in decibels. 

REFERENCES 

1. R. R. Schultz and R. L. Stevenson, Extraction of high-resoIuUon 
frames from video sequences. IEEE Trans. Image Process. 5f6) 
1996,996-1011. ' 

1 C A. Bcrenslein, L. N. Kanal, D. Lavine, and E. C Olson, A geomet- 
ric approach to subpixcl registration accuracy, Computer Vision, 
Graphics, and Image Processing, 40, 1987, 334-360. 

with 3-D recursive search block-matching. Signal Processing: Image 
Communications 6, 1994, 229-239. 

4. 0. Tian and M. N. Huhns, Algorithms for subpixel registration. Com- 
puter Vision Graphics, Image Processing 35, 1986, 220-233. 

5. R. Y. Tsai and T. S. Huang. Multiframe image restoration and regis- 
tration, in Advances in Computer Vision and Image Processing (R. Y. 
Tsai and T. S. Huang, Eds.). Vol. 1, pp. 317-339, JAI Press. Lon- 

...... ,dQAv.l9Wk ...„,,..,..„. ..... .,, .... ...... 

.-^A^Ielulp^JK-J^o^ 

reconslr*ictton. torn 4ower-xe^uuon-image .^eqiienees-and-. space- 
varying image restoration, in Proceedings, IEEE International Confer- 
ence on Acoustics, Speech, and Signal Processing, San Francisco CA. 
1992, pp. III-169 to IIT-172. 



*<•) 

a 

X 
M 

PrC) 

XI.: 

y 

A' 

4s/VK 



7. A. J. Path, M. I. Sczan, and A. M. Tckalp, High-resolution standards 
conversion of low resolution video, in Proceedings. IEEE Interna- 
lional Conference on Acoustics, Speech, and Signal Processing De- 
troit. MI, 1995, pp. 2197-2200. 

8. P. Cheesman, B. Kancfsky, R Kraft, J. Stutt. and R. Hanson, Super- 
Resolved Surface Reconstruction from Multiple Images, Tech. Rep. 
HA -94- 12, NASA Ames Research Center, Moffett Field. CA, 1994* 

9. R. R. Schultt and R. L. Stevenson, Motion-compensated scan conver- 
sion of interlaced video sequences, in Proceedings of the IS&T/SPIE 
Conference on Image and Video Processing IV, vol. 2666. (San Jose 
CA). pp, 107-U8. 1996. ' 

10. S. Mann and R. W. Picard, Virtual bellows: constructing high quality 
sulls from video, in Proceedings of the IEEE International Conference 
on Image Processing (Austin, TX), pp. 363-367. 1994. 

11. M. Bierling and R. Thorna, Motion compensating field interpolation 
usmg a hierarchically structured displacement estimator, Signal Pro- 
cessing 11(4), 1986, 387-404. * 

12. H. G. Musmarm, P. Pirscb. and H.-J. Grallert, Advances in picture 
coding, Proceedings of the IEEE 73(4), 1985, 523-548. 

13. B. K. P, Horn and B. G. Schunck, Determining optical flow, Artificial 
Intelligence 17,1981,185-203. 

14. A. K. Jain, Fundamentals of Digital Image Procezsing, Prenticc-HaU 
Englcwood Cliffs. NJ, 1989. 

15. H. H. Hou and H. C. Andrews, Cubic splines for image interpolation 
and digital filtering, IEEE Transactions on Acoustics, Speech and 
Signal Processing 26(6), 1978. 508-$17. 

16. R. R. Schultz and R. L. Stevenson' A Bayesian approach to image 
expansion for improved defimtion, IEEE Transactions on Image Pro- 
cessing 3(3), 1994,233-242. ■ * 

17. S. Gcman and D. Gcman, Stochastic relaxation, Gibbs distributions 
and the Bayesian restoration of images. IEEE Transactions on Partem 
Analysis and Machine Intelligence 6(6), 1984, 721-741. 

18. I M. Ortega and W. C Rheinbolt, Iterative Solutions of Nonlinear 
Equations m Several Variables. Computer Science and Applied Math- 
ematics, New York; Academic Press, 1970. 

19. A. M. Tekalp. Digital Video Processing. Upper Saddle River NJ* 
Prentice Hall.Inc, 1995. 

20. D Shulman and J.-Y. Hcrve\ Regulation of discontinuous flow 
fields, in Proceedings, Workshop on Visual Motion, Irvine, CA, 1989 
pp. 81-86. ' * 




RICHARD R. SCHULTZ was bom on March 19. 1967, in Grafton 
North Dakota. He received the B.S.EE degree (summa cum laude) from 
the University of North Dakota in 1990, and the M.S.EE and PhD de- 
grees from the University of Notre Dame in 1992 and 1995. respectively 
,Sr^v e Department of Electrical Engineering at the 
tmivertfty'ofNortBDakota fh 1995,SiWe^ 

• FacwMy Early earwri^velopiwmHeAREWJ^x^oiMegMieWs-llfjv • 
age Passing research and educational activities. Dr. Schultz is a member 
of the IEEE. SP1E, ASEE. Eta Kappa Nu. and Tau Beta Pi. His^™[ 
research mterests include digital signal, image, and video processing. 



This Page is Inserted by IFW Indexing and Scanning 
Operations and is not part of the Official Record 

BEST AVAILABLE IMAGES 

Defective images within this document are accurate representations of the original 
documents submitted by the applicant. 

Defects in the images include but are not limited to the items checked: 

SI BLACK BORDERS 

HI IMAGE CUT OFF AT TOP, BOTTOM OR SIDES 
HI FADED TEXT OR DRAWING 

□ BLURRED OR ILLEGIBLE TEXT OR DRAWING 

□ SKEWED/SLANTED IMAGES 

□ COLOR OR BLACK AND WHITE PHOTOGRAPHS 
HI GRAY SCALE DOCUMENTS 

HI LINES OR MARKS ON ORIGINAL DOCUMENT 

□ REFERENCE(S) OR EXHIBIT(S) SUBMITTED ARE POOR QUALITY 

□ OTHER: 

IMAGES ARE BEST AVAILABLE COPY. 
As rescanning these documents will not correct the image 
problems checked, please do not report these problems to 
the IFW Image Problem Mailbox. 



50 



SQIULTZ, MENG, AND STEVENSON 



LI MENG was born on September 23, 1971, in Beijing, China. She 
received the US degree in electrical engineering from Beijing University 
in 1995, and the M.S.E.E. degree from the University of North Dakota 
in 1996. Ms. Mcng joined LSI Logic Corporation (Milpita*. CA) in 1997. 
She conducts research and development within the Digital video Prod- 
uct* group. 




ROBERT L STEVENSON received the B.E.E. degree (summa 



cum laudc) from the University of Delaware in 1986, and the Ph.D. 
in electrical engineering from Purdue University in 1990, While at 
Purdue, he was supported by graduate fellowships from the National 
Science Foundation, DuPom Corporation. Phi Kappa Phi, and Purdue 
University. He joined the faculty of the Department of Electrical 
Engineering at the University of Notre Dame in 1990, where he is 
currently an Associate Professor. His research interests include image/ 
video processing, image/ video compression* robust image /video commu- 
nication systems, multimedia systems, ill-posed problems in computa- 
tional vision, and computational issues in image processing. Dr. Steven- 
son currently serves as an Associate Editor for the IEEE Transactions 
on Circuits and Systems for Video Technology. He is the General 
Chair of the 1998 Midwest Symposium on Circuits and Systems to be 
held on the campus of the University of Notre Dame, and he will 
co-chair the 1999 Conference on Visual Communications and Image 
Processing. 



Presented at the SPIE Conference on Applications of Digital Image Processing XXVII Paper No. 5558-53 

Special Session on Advances in the New Emerging Standard: H.264/A VC, August, 2004 

The H.264/AVC Advanced Video Coding Standard: 
Overview and Introduction to the Fidelity Range Extensions 

Gary J. Sullivan*, Pankaj Topiwala*, and Ajay Luthra* 

'Microsoft Corporation, One Microsoft Way, Redmond, WA 98052 
TastVDO LLC, 7150 Riverwood Dr., Columbia, MD 21046 
^Motorola Inc., BCS, 6420 Sequence Dr., San Diego, CA 92121 



ABSTRACT 

H.264/MPEG-4 AVC is the latest international video coding standard. It was jointly developed by the Video Coding 
Experts Group (VCEG) of the ITU-T and the Moving Picture Experts Group (MPEG) of ISO/IEC. It uses state-of-the-art 
coding tools and provides enhanced coding efficiency for a wide range of applications including video telephony, video 
conferencing, TV, storage (DVD and/or hard disk based, especially high-definition DVD), streaming video, digital video 
authoring, digital cinema, and many others. The work on a new set of extensions to this standard has recently been 
completed. These extensions, known as the Fidelity Range Extensions (FRExt), provide a number of enhanced 
capabilities relative to the base specification as approved in the Spring of 2003. In this paper, an overview of this 
standard is provided, including the highlights of the capabilities of the new FRExt features. Some comparisons with the 
existing standards, MPEG-2 and MPEG-4 Part 2, are also provided. 

Keywords: Advanced Video Coding (AVC), Digital Video Compression, H.263, H.264, JVT, MPEG, MPEG-2 
MPEG-4, MPEG-4 part 10, VCEG. 



L INTRODUCTION 

Since the early 1990s, when the technology was in its infancy, international video coding standards - chronologically 
H.261 [1], MPEG-1 [2], MPEG-2 / H.262 [3], H.263 [4], and MPEG-4 (Part 2) [5] - have been the engines behind the 
commercial success of digital video compression. They have played pivotal roles in spreading the technology by 
providing the power of interoperability among products developed by different manufacturers, while at the same time 
allowing enough flexibility for ingenuity in optimizing and molding the technology to fit a given application and making 
the cost-performance trade-offs best suited to particular requirements. They have provided much-needed assurance to 
the content creators that their content will run eveiywhere and they do not have to create and manage multiple copies of 
the same content to match the products of different manufacturers. They have allowed the economy of scale to allow 
steep reduction in cost for the masses to be able to afford the technology. They have nurtured open interactions among 
experts from different companies to promote innovation and to keep pace with the implementation technology and the 
needs of the applications. 

ITU-T H.264 / MPEG-4 (Part 10) Advanced Video Coding (commonly referred as H.264/AVC) [6] is the newest entry 
in the series of international video coding standards. It is currently the most powerful and state-of-the-art standard, and 
was developed by a Joint Video Team (JVT) consisting of experts from ITU-T's Video Coding Experts Group (VCEG) 
and ISO/IEC's Moving Picture Experts Group (MPEG). As has been the case with past standards, its design provides 
the most current balance between the coding efficiency, implementation complexity, and cost - based on state of VLSI 
design technology (CPU's, DSP's, ASIC's, FPGA's, etc.). In the process, a standard was created that improved coding 
efficiency by a factor of at least about two (on average) over MPEG-2 - the most widely used video coding standard 
today - while keeping the cost within an acceptable range. In July, 2004, a new amendment was added to this standard, 
called the Fidelity Range Extensions (FRExt, Amendment 1), which demonstrates even further coding efficiency against 
MPEG-2, potentially by as much as 3:1 for some key applications. In this paper, we develop an outline of the first 
version of the H.264/AVC standard, and provide an introduction to the newly-minted extension, which, for reasons we 
explain, is already receiving wide attention in the industry. 



1.1. H.264/AVC History 

H.264/AVC was developed over a period of about four years. The roots of this standard lie in the ITU-T's H 26L 
project initiated by the Video Coding Experts Group (VCEG), which issued a Call for Proposals (CfP) in early 1998 and 
created a first draft design for its new standard in August of 1 999. In 2001, when ISO/IECs Moving Pictures Experts 
Group (MPEG) had finished development of its most recent video coding standard, known as MPEG-4 Part 2 it issued a 
similar CfP to invite new contributions to farther improve the coding efficiency beyond what was achieved on that 
project. VCEG chose to provide its draft design in response to MPEG's CfP and proposed joining forces to complete the 
work. Several other proposals were also submitted and were tested by MPEG as well. As a result of those tests, MPEG 
made the following conclusions that affirmed the design choices made by VCEG for H.26L: 

♦ Motion compensated Discrete Cosine Transform (DCT) structure was superior to others, implying there was no 
need, at least at that stage, to make fundamental structural changes for the next generation of coding standard. 

♦ Some video coding tools that have been excluded in the past (for MPEG-2, H.263, or MPEG-4 Part 2) due to their 
complexity (hence implementation cost) could be re-examined for inclusion in the next standard. The VLSI 
technology had advanced significantly since the development of those standards in the past and it has reduced 
significantly the implementation cost of those coding tools. (This was not a "blank check" for compression at all 
costs, as a number of compromises were still necessary for complexity reasons, but it was a recognition that some of 
the complexity constraints that governed past work could be re-examined.) 

♦ To allow maximum freedom of improving the coding efficiency, the syntax of the new coding standard could not be 
backward compatible with prior standards. 

♦ ITU-T's H.26L was a top-performing proposal, and most others that showed good performance in MPEG had also 
been based on H.26L (as it had become well-known as an advance in technology by that time). 

Therefore, to allow speedy progress, ITU-T and ISO/IEC agreed to join forces together to jointly develop the next 
generation of video coding standard and use H.26L as the starting point. A Joint Video Team (JVT), consisting of 
experts from VCEG and MPEG, was formed in December, 2001, with the goal of completing the technical development 
of the standard by 2003. ITU-T planned to adopt the standard under the name of ITU-T H.264, and ISO/IEC planned to 
adopt the standard as MPEG-4 Part 10 Advanced Video Coding (AVC), in the MPEG-4 suite of standards formally 
designated as ISO/IEC 14496. As an unwanted byproduct, this standard gets referred to by at least six different names - 
H.264, H.26L, ISO/IEC 14496-10, JVT, MPEG-4 AVC and MPEG-4 Part 10. In this paper we refer it as H.264/AVC as 
a balance between the names used in the two organizations. 

With the wide breadth of applications considered by the two organizations, the application focus for the work was 
correspondingly broad - from video conferencing to entertainment (broadcasting over cable, satellite, terrestrial, cable 
modem, DSL etc.; storage on DVDs and hard disks; video on demand etc.) to streaming video, surveillance and military 
applications, and digital cinema. Three basic feature sets called profiles were established to address these application 
domains: the Baseline, Main, and Extended profiles. The Baseline profile was designed to minimize complexity and 
provide high robustness and flexibility for use over a broad range of network environments and conditions; the Main 
profile was designed with an emphasis on compression coding efficiency capability; and the Extended profile was 
designed to combine the robustness of the Basline profile with a higher degree of coding efficiency and greater network 
robustness and to add enhanced modes useful for special "trick uses" for such applications as flexible video streaming. 

1.2. The FRExt Amendment 

While having a broad range of applications, the initial H.264/AVC standard (as it was completed in May of 2003), was 
primarily focused on "entertainment-quality" video, based on 8 bits/sample, and 4:2:0 chroma sampling. Given its time 
constraints, it did not include support for use in the most demanding professional environments, and the design had not 
been focused on the highest video resolutions. For applications such as content-contribution, content-distribution, and 
studio editing and post-processing, it may be necessary to 

♦ Use more than 8 bits per sample of source video accuracy 

♦ Use higher resolution for color representation than what is typical in consumer applications (i.e., to use 4:2:2 or 
4:4:4 sampling as opposed to 4:2:0 chroma sampling format) 

♦ Perform source editing functions such as alpha blending (a process for blending of multiple video scenes, best 
known for use in weather reporting where it is used to super-impose video of a newscaster over video of a map or 
weather-radar scene) 



♦ Use very high bit rates 

♦ Use very high resolution 

♦ Achieve very high fidelity - even representing some parts of the video lossiessly 

♦ Avoid color-space transformation rounding error 

♦ Use RGB color representation 

To address the needs of these most-demanding applications, a continuation of the joint project was launched to add new 
extensions to the capabilities of the original standard. This effort took about one year to complete - starting with a first 
draft in May of 2003, the final specification was completed in July of 2004, and the editing period was completed in 
August of 2004. These extensions, originally known as the "professional" extensions, were eventually renamed as the 
"fidelity range extensions" (FRExt) to better indicate the spirit of the extensions. 

In the process of designing the FRExt amendment, the JVT was able to go back and re-examine several prior technical 
proposals that had not been included in the initial standard due to scheduling constraints, uncertainty about benefits, or 
the original scope of intended applications. With the additional time afforded by the extension project, it was possible to 
include some of those features in the new extensions. Specifically, these included: 

♦ Supporting an adaptive block-size for the residual spatial frequency transform, 

♦ Supporting encoder-specified perceptual-based quantization scaling matrices, and 

♦ Supporting efficient lossless representation of specific regions in video content. 

The FRExt project produced a suite of four new profiles collectively called the High profiles: 

♦ The High profile (HP), supporting 8-bit video with 4:2:0 sampling, addressing high-end consumer use and other 
applications using high-resolution video without a need for extended chroma formats or extended sample accuracy 

♦ The High 10 profile (HilOP), supporting 4:2:0 video with up to 10 bits of representation accuracy per sample 

♦ The High 4:2:2 profile (H422P), supporting up to 4:2:2 chroma sampling and up to 10 bits per sample, and 

♦ The High 4:4:4 profile (H444P), supporting up to 4:4:4 chroma sampling,up to 12 bits per sample, and additionally 
supporting efficient lossless region coding and an integer residual color transform for coding RGB video while 
avoiding color-space transformation error 

All of these profiles support all features of the prior Main profile, and additionally support an adaptive transform block- 
size and perceptual quantization scaling matrices. 

Initial industry feedback has been dramatic in its rapid embrace of FRExt. The High profile appears certain to be 
incorporated into several important near-term application specifications, particularly including 

♦ The HD-DVD specification of the DVD Forum 

♦ The BD-ROM Video specification of the Blu-ray Disc Association, and 

♦ The DVB (digital video broadcast) standards for European broadcast television 

Several other environments may soon embrace it as well (e.g., the Advanced Television Systems Committee (ATSC) in 
the U.S., and various designs for satellite and cable television). Indeed, it appears that the High profile may rapidly 
overtake the Main profile in terms of dominant near-term industry implementation interest. This is because the High 
profile adds more coding efficiency to what was previously defined in the Main profile, without adding a significant 
amount of implementation complexity. 



2. CODING TOOLS 

At a basic overview level, the coding structure of this standard is similar to that of all prior major digital video standards 
(H.26I, MPEG-1, MPEG-2 / H.262, H.263 or MPEG-4 part 2). The architecture and the core building blocks of the 
encoder are shown in Fig. 1 and Fig. 2, indicating that it is also based on motion-compensated DCT-like transform 
coding. Each picture is compressed by partitioning it as one or more slices; each slice consists of macroblocks, which 
are blocks of 16x16 luma samples with corresponding chroma samples. However, each macroblock is also divided into 
sub-macroblock partitions for motion-compensated prediction. The prediction partitions can have seven different sizes - 
16x16, 16x8, 8x16, 8x8, 8x4, 4x8 and 4x4. In past standards, motion compensation used entire macroblocks or, in the 



case of newer designs, 16x16 or 8x8 partitions, so the larger variety of partition shapes provides enhanced prediction 
accuracy. The spatial transform for the residual data is then either 8x8 (a size supported only in FRExt) or 4x4. In past 
major standards, the transform block size has always been 8x8, so the 4x4 block size provides an enhanced specificity in 
locating residual difference signals. The block size used for the spatial transform is always either the same or smaller 
than the block size used for prediction. The hierarchy of a video sequence, from sequence, to samples is given by; 

sequence (pictures (slices (macroblocks (macroblock partitions (sub-macroblock partitions (blocks (samples)))))). 

In addition, there may be additional structures such as packetization schemes, channel codes, etc, which relate to the 
delivery of the video data, not to mention other data streams such as audio. As the video compression tools primarily 
work at or below the slice layer, bits associated with the slice layer and below are identified as Video Coding Layer 
(VCL) and bits associated with higher layers are identified as Network Abstraction Layer (NAL) data. VCL data and the 
highest levels of NAL data can be sent together as part of one single bit stream or can be sent separately. The NAL is 
designed to fit a variety of delivery frameworks (e.g., broadcast, wireless, on media). Herein, we only discuss the VCL, 
which is the heart of the compression capability. While an encoder block diagram is shown in Fig. 1, the decoder 
conceptually works in reverse, comprising primarily an entropy decoder and the processing elements of the region 
shaded in Fig. 1. e 




Fig. 1: High-level encoder architecture 
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Fig. 2: Higher-level encoder block diagram 

In the first version of the standard, only the 4:2:0 chroma format (typically derived by performing an RGB-to-YCbCr 
color-space transformation and subsampling the chroma components by a factor of 2:1 both horizontally and vertically) 
and only 8 bit resolution for luma and chroma values was supported. The FRExt amendment extended the standard to 
4:2:2 and 4:4:4 chroma formats and higher than 8 bits resolution, with optional support of auxiliary pictures for such 
purposes as alpha blending composition. 

The basic unit of the encoding or decoding process is the macroblock. In 4:2:0 chroma format, each macroblock consists 
of a 16x16 region of luma samples and two corresponding 8x8 chroma sample arrays. In a macroblock of 4:2:2 chroma 
format video, the chroma sample arrays are 8x16 in size; and in a macroblock of 4:4:4 chroma format video, they are 
16x16 in size. 

Slices in a picture are compressed by using the following coding tools: 

♦ "Intra" spatial (block based) prediction 

o Full-macroblock luma or chroma prediction - 4 modes (directions) for prediction 
o 8x8 (FRExt-only) or 4x4 luma prediction - 9 modes (directions) for prediction 

♦ "Inter" temporal prediction - block based motion estimation and compensation 
o Multiple reference pictures 

Reference B pictures 
Arbitrary referencing order 
Variable block sizes for motion compensation 

■ Seven block sizes: 16x16, 16x8, 8x16, 8x8, 8x4, 4x8 and 4x4 
l/4"sample luma interpolation (1/4 or l/8th-sample chroma interpolation) 
Weighted prediction 

Frame or Field based motion estimation for interlaced scanned video 
Interlaced coding features 
o Frame-field adaptation 

■ Picture Adaptive Frame Field (PicAFF) 

■ MacroBlock Adaptive Frame Field (MBAFF) 
o Field scan 

Lossless representation capability 
o Intra PCM raw sample- value macrob locks 

o Entropy-coded transform-bypass lossless macroblocks (FRExt-only) 

8x8 (FRExt-only) or 4x4 Integer Inverse Transform (conceptually similar to the well-known DCT) 
Residual color transform for efficient RGB coding without conversion loss or bit expansion (FRExt-only) 
Scalar quantization 

Encoder-specified perceptually weighted quantization scaling matrices (FRExt-only) 
Logarithmic control of quantization step size as a function of quantization control parameter 
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♦ 



Deblocking filter (within the motion compensation loop) 

♦ Coefficient scanning 
o Zig-Zag (Frame) 
o Field 

♦ Lossless Entropy coding 

o Universal Variable Length Coding (UVLC) using Exp-Golomb codes 
o Context Adaptive VLC (CAVLC) 

o Context-based Adaptive Binary Arithmetic Coding (CABAC) 

♦ Error Resilience Tools 

o Flexible Macroblock Ordering (FMO) 
o Arbitrary Slice Order (ASO) 
o Redundant Slices 

♦ SP and SI synchronization pictures for streaming and other uses 

♦ Various color spaces supported (YCbCr of various types, YCgCo, RGB, etc. - especially in FRExt) 

♦ 4:2:0, 4:2:2 (FRExt-only), and 4:4:4 (FRExt-only) color formats 

♦ Auxiliary pictures for alpha blending (FRExt-only) 

Of course each slice need not use all of the above coding tools. Depending upon on the subset of coding tools used a 
slice can be of I (Intra), P (Predicted), B (Bi-predicted), SP (Switching P) or SI (Switching I) type. B-slices come in *vo 
flavors - reference or non-reference. Reference B-slices can be used as reference for temporal prediction. A picture 
may contain different slice types. In the next section we describe the coding tools used for these different slice types. 

This standard is designed to perform well for both progressive-scan and interiaced-scan video. In interlaced-scan video, 
a frame consists of two fields - each captured at !4 the frame duration apart in time. Because the fields are captured with 
significant time gap, the spatial correlation among adjacent lines of a frame is reduced in the parts of picture containing 
moving objects. Therefore, from coding efficiency point of view, a decision needs to be made whether to compress 
video as one single frame or as two separate fields. H.264/AVC allows that decision to be made either independently for 
every two-vertical-macroblock pair or independently for each entire frame. When the decisions are made at the 
macroblock-pair level, this is called MacroBlock Adaptive Frame-Field (MBAFF) coding and when the decisions are 
made at the frame level then this is called Picture-Adaptive Frame-Field (PicAFF) coding. Notice that in MBAFF, 
unlike in the MPEG-2 standard, the frame or field decision is made for the vertical macroblock-pair and not for each 
individual macroblock. This allows retaining a 16x16 sire for each macroblock and the same size for all sub- 
macroblock partitions - regardless of whether the macroblock is processed in frame or field mode and regardless of 
whether the mode switching is at the picture level or the macroblock-pair level. 

2.1. i-slice 

In I-slices (and in intra macroblocks of non-I slices) pixel values are first spatially predicted from their neighboring pixel 
values. After spatial prediction, the residual information is transformed using a 4x4 transform or an 8x8 transform 
(FRExt-only) and then quantized. In FRExt, the quantization process supports encoder-specified perceptual-based 
quantization scaling matrices to optimize the quantization process according to the visibility of the specific frequency 
associated with the transform coefficient. Quantized coefficients of the transform are scanned in one of the two different 
ways (zig-zag or field scan) and are compressed by entropy coding using one of two methods - CAVLC or CABAC. In 
PicAFF operation, each field is compressed in a manner analogous to the processing of an entire frame. In MBAFF 
operation, if a macroblock pair is in field mode then the field neighbors are used for spatial prediction and if a 
macroblock pair is in frame mode, frame neighbors are used for prediction. The frame or field decision is made before 
applying rest of the coding tools described below. Temporal prediction is not used in intra macroblocks, but it is for P 
and B macroblock types, which is the main difference between the slice types. We therefore review the structure of the 
codec for the I-slice first, and then review the key differences for P and B-slices later. 
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2.1.1. Spatial Prediction 

To exploit spatial correlation among pixels, spatial prediction modes are defined: 

♦ Full-macroblock prediction for 1 6x 16 luma or the corresponding chroma block size, or 

♦ 8x8 luma prediction (FRExt-only), or 

♦ 4x4 luma prediction. 

For luma, the full-macroblock prediction is 16x16 in size. 

In 16x16 spatial prediction mode, the luma values of an entire 16x16 macroblock are predicted from the pixels around 
0>e edges as shown in the Fig. 3. Prediction can be made in one of the four different ways: (i) vertical, (ii) horizontal, 
in) DC and (iv) planar In the vertical and horizontal predictions the luma values of a macroblock are predicted from 
the pixels just above or left of the macroblock, respectively. In DC prediction, the luma values of the neighboring pixels 
are averaged and that average value is used as predictor. In the planar prediction, it is assumed that the macroblock 
covers diagonally increasing luma values and the predictor is formed based upon the planar equation. 

In 4x4 spatial prediction mode, the luma values of 4x4 blocks are predicted from the neighboring pixels above or left of 
the block and 9 different directional ways of prediction are allowed (see Fig. 3). Each prediction direction specifies a 
particular set of spatially-dependent linear combinations of previously decoded samples for use as the prediction of each 
input sample. 

8x8 luma intra prediction (available only in FRExt profiles) uses basically the same concept as 4x4 prediction, but with a 
block size that is 8x8 rather than 4x4 and with low-pass filtering of the predictor to improve prediction performance. 
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Fig. 3: Spatial prediction of a 4x4 block. 



Chroma always operates using ruli-macroblock prediction; Because of differences in the size of the chroma arrays for 

a^T u m different Chroma f0rmatS (U " 8x8 chroma h 4:2:0 "^blocks, 8x16 chroma in 4:2:2 macroblocks, 
and 16x16 chroma in 4:4:4 macroblocks), chroma prediction is defined for three possible block sizes. In all of these 
cases the chroma blocks are predicted in a manner similar to the 16x16 luma macroblock prediction: (i) DC 
(ii) Horizontal, (iii) Vertical and (iv) planar. 
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2.1.2. Transform and Quantization 

t nC L SP f a ' pK ? M ° n : a «n«sforrn is applied to decollate the data spatially. There are several unique features about 
tne transform selected for this coding standard. Some of these features are listed below. 

(1) It is the first video standard fundamentally based on an integer inverse transform design for its main spatial 
transforms, rather than using a floating point inverse transform. The forward transform that will typically be 
used for encoding is also an integer transform. MPEG-4 part 2 and JPEG2000 had previously included integer 
wa ^5 a " sfor ? s - But JPEG2000 is an image coding standard without support for interframe prediction, and 
in MPEG-4, the integer transforms are used only rarely for what is called texture coding (somewhat equivalent 
to the usual I-frame coding, but not found in most implementations of MPEG-4), and the main transform used 
for nearly all video data was still specified as a floating point 8x8 IDCT. A significant advantage of this fact is 
that with an exact integer inverse transform, there is now no possibility of a mismatch between then encoder 
and decoder, unlike for MPEG-2 and ordinary MPEG-4 part 2. (The integer transform concept had also been 
previously applied in H.263 Annex W, but only as an after-the-fact patch to a prior specification in terms of the 
8x8 floating point IDCT.) 

In fact, the transform is specified so that for 8-bh input video data, it can be easily implemented using only 16- 
bit anthmetic, rather than the 32-bit or greater precision needed for the transform specified in prior standards 
The transform (at least for the 4x4 block size supported without FRExt) is designed to be so simple that it can 
be implemented using just a few additions, subtractions, and bit shifts. 
(4) A 4x4 transform size is supported, rather than just 8x8. Inconsistencies between neighboring blocks occur at a 
smaller granularity, and thus tend to be less noticeable. Isolated features can be represented with greater 
accuracy in spatial location (reducing a phenomenon known as "ringing"). For certain hardware 
implementations, the small block size may also be particularly convenient. 

Thus while the macroblock size remains at 16x16, these are divided up into 4x4 or 8x8 blocks, and a 4x4 or 8x8 block 
transform T is applied to every block of pixels. For the 4x4 block size, the transform is specified as follows (other 
transforms are also defined here for use below): 
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For the 8x8 block size (when selected, and applied only to luma and only with FRExt extensions), a somewhat more 
complex (but still remarkably simple when compared to an ordinary 8x8 IDCT) transformation matrix is used: 
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The transform Txs applied to the luma (16x16) and chroma (8x8, or in FRExt, 8x16 or 16x16) samples for a macroblock 
by segmenting the full sample block size into smaller blocks for transformation as necessary. In addition, when the 
16x16 Intra prediction mode is used with the 4x4 transform, the DC coefficients of the 16 4x4 blocks in that macroblock 
are further selected and transformed using the Hadamard transform H (note the basic similarity of T and H) Meanwhile 
at the general 4x4 level, the corresponding 2x2 chroma samples for 4:2:0 are transformed according to the matrix C 



-8- 



>ve a 2x2 Hadamard transform. For 4:2:2 and 4:4:4 chroma formats, the Hadamard block size is increased to reflect 
enlarged block shape. 



The coefficients after the transformation are quantized using a quantization parameter that can be changed for every 
macroblock. The parameter can take one of the 52 possible values when video format supports 8 bits per decoded 
sample. When supporting greater bit depth video content, FRExt expands the number of steps by 6 for each additional 
bit of decoded sample accuracy. Importantly, the quantization step-sizes are not linearly related to the quantization 
parameter (as in all prior standards), but vary in such a way that the quantization step size exactly doubles for every 6 
increments of the quantization parameter. 

A default relationship is specified between the quantization step sizes used for luma and chroma, and the encoder can 
adjust this relationship at the slice level to balance the desired fidelity of the color components. With the FRExt 
amendment, the encoder can also balance the Cb and Cr fidelity separately relative to each other. 

2.13. Perceptual-based quantization scaling matrices 

The new FRExt amendment adds support for a feature that had been a mainstay of prior use in MPEG-2 - namely, 
perceptual-based quantization scaling matrices. The encoder can specify, for each transform block size and separately 
for intra and inter prediction, a customized scaling factor for use in inverse-quantization scaling by the decoder. This 
allows tuning of the quantization fidelity according to a model of the sensitivity of the human visual system to different 
types of error. It typically does not improve objective fidelity as measured by mean-squared error (or, equivalently, 
PSNR), but it does improve subjective fidelity, which is really the more important criterion. Default values for the ' 
quantization scaling matrices are specified in the standard, and the encoder can choose to instead use customized values 
by sending a representation of those values at the sequence or picture level. 

2.1.4. Scanning 

If a macroblock is compressed in the frame mode then the quantized coefficients of the transform are scanned in the zig- 
zag fashion as shown in Fig. 4a for the 4x4 block size. This scan is designed to order the highest-variance coefficients 
first and to maximize the number of consecutive zero-valued coefficients appearing in the scan. 




a b 
Fig. 4: Coefficient scanning order in (a) Frame and (b) Field modes 

If a macroblock is compressed in the field mode then the scanning order of the coefficients is modified to be more 
efficient for field scanning as shown in Fig. 4b - reflecting the decreased correlation of the source data in the vertical 
dimension. For other block sizes such as 8x8, the same concepts apply, with similar scans specified for each block size. 

2.1.5. Entropy coding 

Entropy coding is a lossless coding technique that replaces data elements with coded representations which, in 
combination with the previously-described predictions and transformations, can result in significantly reduced data size 
(even though on its own, it can only reduce the data size modestly). Two modes of entropy coding are used in this 
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standard: variable length coding (VLC), a type of Huffman coding, and binary arithmetic coding (BAC). In the current 
structure, both of these designs are context adaptive (CA), leading to CAVLC and CABAC. Syntax elements at and 
below the slice layer can be adaptively coded. 

Like previous standards, the macroblock is the fundamental unit of the syntax. Data elements to be encoded include: 

(a) higher syntax elements for sequence and picture, which are coded using special fixed VLC codes; 

(b) slice layer (at and below which layer the entropy codes CAVLC or CABAC are used); 

(c) macroblock type (mbjype); 

(d) coded block pattern (CBP), which indicates which sets of 4x4 blocks have non-zero elements (thus efficiently 
indicating which subblocks not to code); 

(e) quantization parameter, coded as a delta from the previous macroblock; 

(f) reference frame index; 

(g) motion vector (MV), sent as a delta from previous MV; and finally 

(h) the quantized transform coefficients (from 8x8 or 4x4 transformations, or from secondary Hadamard 
transformations applied to DC coefficients of lower-level 4x4 transformations). 

The final two types of data comprise the bulk of the coded data. At very low bitrates, (g) may be dominant; at all higher 
rates, (h) is dominant, and it is in encoding the transform coefficients that context adaptivity is primarily employed. 



2.1.5.1. CAVLC 

The principle idea of VLC is that the data elements to be coded (whether quantized transform coefficients, differential 
motion vectors, or other syntactic symbols) occur with unequal frequencies; frequently-occurring elements are assigned 
short codes, while infrequent elements are assigned long codes (thus variable length coding - often called Huffman 
coding as Huffman codes are the most well-known type of VLCs). For syntax elements other than residual transform 
coefficients, a fixed VLC is used. Table 1 shows the first nine elements of the Exp-Golomb Code table for given input 
data elements (here called codeNum). The codeNum is typically an index to the actual data elements (e.g., signed 
elements such as motion vector differences are mapped by [0, 1, -1, 2, ...] to [0, 1, 2, 3, ,..]). These codes have the 
generic form of [K zeros][l][K-bit DATA], where DATA is a binary representation of an unsigned integer. Such a code 
is decodable as codeNum = 2 K + int(DATA) -1, where int(DATA) is now the integer corresponding to the binary string. 
While only the first nine elements are shown, the exp-Golomb code table is conceptually infinite in length. The actual 
limits on the values of the coded syntax elements are specified by various constraints imposed in the standard. 

Table 1: Exponential Golomb Codes (for data elements other than transform coefficients - these codes 
are actually fixed, and are also called Universal Variable Length Codes (UVLC) [7]) 





code 


I 0 


1 


1 


010 


2 


011 


3 


00100 I 


4 


00101 


! 5 


00110 


I 6 


00111 ! 


7 


0001000 


8 


0001001 | 



But for greater efficiency in coding the abundant residual transform coefficients, this standard maintains 1 1 different sets 
of codes (4 for the number of coefficients, and 7 for the actual coefficients), which are adapted to the statistics of the 
current stream or context (thus CAVLC). The latter seven tables are geared towards the lower to higher levels 
consecutively; coding is typically initialized to lower tables, and incremented up depending on the size of the levels 
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coded. Given the execution efficiency of VLC tables, combined with this limited adaptivity (which thus permits 
parallelization by macroblocks), this provides a nice tradeoff between speed of execution and performance. 

2.1.5.2. CABAC 

The use of context-based adaptive binary arithmetic coding (CABAC) has been adopted into the standard as a way of 
gaining additional performance relative to CAVLC coding, at the cost of additional complexity. The CABAC mode has 
been shown to increase compression efficiency by roughly 10% relative to the CAVLC mode, although CABAC is 
significantly more computationally complex. Here the use of arithmetic coding permits non-integer number of bits per 
symbol, adaptivity allows the coder to adjust to changing symbol statistics, and the context modeling improves 
prediction performance. CABAC is used for encoding a broader range of syntax elements than CAVLC, starting at the 
Slice layer (while higher-layer syntax elements are still codec with fixed codes such as Exp-Golomb). In particular, 
motion vectors as well as residual transform coefficients are codec with CABAC, leading to tighter encoding, whereas 
with CAVLC only the coefficients are. However, the price to pay is that given the broader applicability of CABAC, 
combined with its various contexts (e.g., 257 of them in the orginal version of the standard, with a few dozen more 
added in FRExt), it is basically a serial engine that can be very compute intensive, especially for high pixel and data 
rates. 




Fig. 5: Generic block diagram of the CABAC entropy coding scheme. 

The steps in the CABAC entropy coding scheme are depicted in Fig. 5. Suppose a symbol for an arbitrary syntax 
element is given. In a first step, a suitable model is chosen according to a set of past observations of relevant syntax 
elements; this is called context modeling. Different models are maintained for each syntax element (e.g., motion vectors 
and transform coefficients have different models). If a given symbol is non-binary valued, it will be mapped onto a 
sequence of binary decisions, so-called bins, in a second step. The actual binarization is done according to a given 
binary tree - and in this case the CAVLC binary tree is used. Finally, each binary decision is encoded with the adaptive 
binary arithmetic coding (BAC) engine using the probability estimates, which have been provided by the context 
modeling stage. The provided models serve as a probability estimation of the related bins. After encoding of each bin, 
the related mode! probability estimate is updated to adjust upward the probability estimate for the binary symbol that 
was encoded. Hence, the model keeps track of the actual statistics. 

Starting with each frame, the probability models associated with all 257 different contexts used in H.254/AVC are 
initialized with a pre-computed initial distribution. For each symbol encoded, the frequency count of the related binary 
decision is updated, thus providing a new probability estimate for the next coding decision. However, when the total 
number of occurrences of a given model exceeds a pre-defined threshold, the frequency counts are scaled down. This 
periodical rescaling exponentially weighs down past observations and helps to adapt to the non-stationarity of a source. 
The context modeling used here is innovative, and is described in detail in the standard itself [6]. The arithmetic coding 
is largely based on the well-known techniques developed in [8]. 



2. 1 .6. Lossless macroblock modes 

When the fidelity of the coded video is high (i.e., when the quantization step size is very small), it is possible in certain 
very rare instances of input picture content for the encoding process to actually cause data expansion rather than 
compression. Furthermore, it is convenient for implementation reasons to have a reasonably- low identifiable limit on 
the number of bits necessary to process in a decoder in order to decode a single macroblock. To address these issues, the 



standard includes a "PCM" macrobiock mode, in which the values of the samples are sent directly - without prediction, 
transformation, or quantization. An additional motivation for support of this macrobiock mode is to allow regions of the 
picture to be represented without any loss of fidelity. 

However, the PCM mode is clearly not efficient - indeed it is not intended to be efficient - rather, it is intended to be 
simple and to impose a low bound on the number of bits that can be used to represent a macrobiock with sufficient 
accuracy. If one considers the bits necessary to indicate which mode has been selected for the macrobiock, the use of the 
PCM mode actually results in a minor degree of data expansion. When developing the FRExt amendment, it was 
decided that a more effective means of lossless coding was desirable for the most demanding applications. FRExt 
therefore also includes a transform-bypass lossless mode which uses prediction and entropy coding for encoding sample 
values. When this mode is enabled (which can only be in Hi444P use), the meaning of the smallest selectable value of 
the quantization parameter is redefined to invoke the lossless coding operation. The new lossless mode of FRExt is a 
fairly efficient lossless video coder - although not the best method for lossless still-picture (intra) coding, it stands out as 
extremely efficient when used together with inter-picture prediction (as described in the following sections). 

2.2. P-slices 

In P-slices (predictively-coded, or "inter" slices), temporal (rather than spatial) prediction is used, by estimating motion 
among pictures. Innovatively, motion can be estimated at the 1 6x16 macrobiock level or by partitioning the macrobiock 
into smaller regions of luma size 16x8, 8x16, 8x8, 8x4, 4x8, 4x4 (see Fig. 6). A distinction is made between a 
macrobiock partition, which corresponds to a luma region of size 16x16, 16x8, 8x16, or 8x8, and sub-macroblock 
partition, which is a region of size 8x8, 8x4, 4x8, or 4x4. When (and only when) the macrobiock partition size is 8x8, 
each macrobiock partition can be divided into sub-macroblock partitions. For example, it is possible within a single 
macrobiock to have both 8x8 and 4x8 partitionings, but not 16x8 and 4x8 partitionings. Thus the first row of Fig. 6 
shows the allowed macrobiock partitions, and the sub-macroblock partitions shown in the second row can be selected 
independently for each 8x8 region, but only when the macrobiock partition size is 8x8 (the last partitioning shown in the 
first row). 

A distinct motion vector can be sent for each sub-macroblock partition. The motion can be estimated from multiple 
pictures that lie either in the past or in the future in display order. The selection of which reference picture is used is 
done on the macrobiock partition level (so different sub-macroblock partitions within the same macrobiock partition 
must use the same reference picture). The limit on number of pictures used for the motion estimation is specified in the 
Levels (described below). To estimate the motion, pixel values are first interpolated to achieve quarter-pixel accuracy 
for luma and up to l/8th pixel accuracy for chroma. Interpolation of luma is performed in two steps - half-pixel and 
then quarter-pixel interpolation. Half-pixel values are created by filtering with the kernel [1 -5 20 20 -5 l]/32, 
horizontally and/or vertically. Quarter-pixel interpolation is performed by averaging two nearby values (horizontally, 
vertically, or diagonally) of half pixel accuracy. After interpolation, block-based motion compensation is applied. As 
noted however, a variety of block sizes can be considered, and a motion estimation scheme that optimizes the trade-off 
between the number of bits necessary to represent the video and the fidelity of the result is desirable. 

If a macrobiock has motion characteristics that allow its motion to be effectively predicted from the motion of 
neighboring macroblocks, and it contains no non-zero quantized transform coefficients then it is flagged as skipped. 
This mode is identified as the Skip mode. Note that, unlike in prior standards, non-zero motion vectors can be inferred 
when using the Skip mode in P slices. 



- 12- 



1 macroblock partition of 
16*16 luma samples and 
associated chroma samples 




2 macroblock partitions of 
16*8 luma samples and 




2 macroblock partitions of 
8*16 luma samples and 
associated chroma samples 




4 sub-macroblocks of 
6*8 luma samples and 
associated chroma samples 



1 sub-macro block partition 
of 8*8 luma samples and 
associated chroma samples 



2 sub-macroblock partitions 
of 8*4 luma samples and 
associated chroma samples 



2 sub-macroblock partitions 
of 4*8 luma samples and 



Sub-macroblock 






4 sub-macroblock partitions 
of 4*4 luma samples and 



0 


1 


2 


3 



Fig. 6: Macroblock partitions for motion estimation and compensation 

In addition to the use of motion compensation and reference picture selection for prediction of the current picture 
content, weighted prediction can be used in P slices. When weighted prediction is used, customized weights can be 
applied as a scaling and offset to the motion-compensated prediction value prior to its use as a predictor for the current 
picture samples. Weighted prediction can be especially effective for such phenomena as "fade-in" and "fade-out" 
scenes. 

After the temporal prediction, the steps of Transform, Quantization, Scanning, and Entropy coding are conceptually the 
same as those for I-slices for the coding residual data (the original minus the predicted pixel values). The motion vectors 
and reference picture indexes representing the estimated motion are also compressed. To compress the motion vectors, a 
median of the motion vectors from the neighboring three sub-macroblocks - left, above and above right or left - is 
obtained and the difference from this median vector and the value of the current motion vector is retained and entropy 
coded. Similarly, the selected reference frame indexes are also entropy coded. 

2.3. B-Slices 

In B-slices, two motion vectors, representing two estimates of the motion, per sub-macroblock partitions are allowed for 
temporal prediction. They can be from any reference picture in future or past in display order. Again, a constraint on 
the number of reference pictures that can be used for motion estimation is specified in the Levels definition. A weighted 
average of the pixel values in the reference pictures is then used as the predictor for each sample. 

B-slices also have a special mode - Direct mode. In this mode the motion vectors for a macroblock are not explicitly 
sent. The receiver derives the motion vectors by scaling the motion vector of the co-located macroblock in another 
reference picture. In this case, the reference picture for the current macroblock is the same as that for the co-located 
macroblock. The motion vector scaling is performed to according to the temporal distances among the current picture, 
the picture containing the co-located macroblock and the reference picture of that co-located macroblock. 

The weighted prediction concept is further extended in the case of B slices. In addition to its use for scaling and 
offsetting a prediction value, in B slices weighted prediction can enable encoder adjustment of the weighting used in the 
weighted average between the two predictions that apply to bi-prediction. This can be especially effective for such 
phenomena as "cross-fades" between different video scenes, as the bi-prediction allows flexible weighted blending of 
content from such scenes. 

Unlike in prior standards, pictures coded using B slices can be used as references for the decoding of subsequent pictures 
in decoding order (with an arbitrary relationship to such pictures in display order). 
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2.4. SP and SI Slices 

Switching P (SP) and Switching I (SI) slices are close cousins of the usual P and 1 slices, utilizing either temporal or 
spatial prediction as before; however, their main virtue is that they can allow reconstruction of specific exact sample 
values, even when using different reference frames in the prediction process. The main usefulness of this property 
(which naturally comes at some cost in coding efficiency when compared to the usual P and I slices) is to allow 
bitstream switching, as well provide additional functionalities such as random access, fast forward, reverse, and stream 
splicing. These tools are only available in the so-called Extended Profile, and are not allowed in the more common 
Baseline or Main Profiles. It is unclear at this time what applications will target the Extended Profile, although these 
tools would be useful for streaming media applications. 

2.5. Deblocking Filter 

As shown in Fig. 1, H.264/AVC uses an in-loop deblocking filter to reduce the blockiness introduced in a picture. The 
filtered pictures are used to predict the motion for other pictures. The deblocking filter is an adaptive filter that adjusts 
its strength depending upon compression mode of a macroblock (Intra or Inter), the quantization parameter, motion 
vector, frame or field coding decision and the pixel values. For smaller quantization sizes the filter shuts itself off. This 
filter can also be shut-off explicitly or adjusted in overall strength by an encoder at the slice level. 

2.6. Error Resilience Tools 

A number of features of the codec design are designed to enable recovery of video fidelity in the presence of network 
transmission errors or losses. For example, the NAL design, with its highly robust treatment of sequence and picture 
header content (more properly called sequence and picture parameter sets in the standard), establishes a high degree of 
robustness. The basic slice structure design adds further robustness, as each slice is designed to completely independent 
of all other slices of a picture in the basic decoding process (prior to application of the deblocking filter, which can also 
be made independent by the encoder if desired). No content of any slice of a picture is used for the prediction of syntax 
elements or sample values used in the decoding process of other slices. Additionally, the encoder can select to indicate 
that the prediction of intra macroblock sample values in P and B slices will not use spatial neighbors that were not also 
coded in intra modes - adding further robustness against temporal error propagation. The multiple-reference picture 
support can also be used by an encoder to enable further resilience against data losses and errors (basically by avoiding 
the use of any pictures as reference pictures in the prediction process if the fidelity of those pictures may have been 
adversely affected by transmission errors or losses). 

Going beyond these basic feature that are an inherent part of the design, there are essentially four additional tools that are 
specified in the standard for further protecting the video bitstream from network transmission errors, which may occur 
for example as a result of congestion overloads on wired networks, or much more frequently due to various channel 
errors in wireless networks. These tools are: 1) Flexible Macroblock Order (FMO), 2) Arbitrary Slide Order (ASO), and 
3) Redundant Slices (RS), and 4) Data Partitioning (DP). FMO and ASO work to randomize the data prior to 
transmission, so that if a segment of data is lost (e.g. a packet, or several continuous packets), the errors are distributed 
more randomly over the video frames, rather than in a single block of data, making it more likely that relevant 
neighboring data is available for recovery of lost content. This helps to preserve more local information in all areas, at 
the cost of some randomly disbributed loss. Redundant Slices offer more protection by reducing the chance of loss via 
redundancy, a common approach at the level of channel coding. The additional tools of Data Partitioning and SP/SI 
slices, available in Extended Profile, are also valuable for error resilience/recovery. 

2.7. Color Space and Residual Color Transform Support 

Like spatial transforms, color transforms to date have generally used floating-point operations and have thus been prone 
to rounding errors. Typically, video is captured and displayed using the RGB (red, green, and blue) color space, but 
these components are typically highly correlated. Further, the human visual system seems better matched to luma 
(brightness) and chroma (hue and saturation) representations, rather than RGB. The usual approach has been to perform 
a color transformation such as RGB-to-YCbCr before compression, and then code the video in the YCbCr domain, as in: 



Y=K H * R + (\-K R -K B )*G + K B *B; Cb 



(B-Y) . 



Cr 



(R-Y) . 



2(1-*,)' 



2(1-**) ' 



with, e.g., K R = 0.2126, K B = 0.0722. 
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There are two problems with this approach. The first is that since the samples are actually represented using integers, 
rounding error is introduced in both the forward and inverse color transformations. The second is that, because the 
above transformation was not originally designed for digital video compression, it uses a sub-optimal trade-off between 
the complexity of the transformation (with difficult-to-implement coefficient values such as 0.2126 and 0.0722) and 
coding efficiency. Focusing on the second problem first, the FRExt amendment adds support for a new color space 
called YCgCo (where the "Cg" stands for green chroma and the "Co" stands for orange chroma), which is much simpler 
and typically has equal or better coding efficiency. It uses the following basic equations: 



While this reduces complexity (and may even improve coding efficiency), it does not, by itself, solve the problem of 
rounding error. However, rounding errors can be avoided if two additional bits of accuracy are used in the chroma, 

FRExt also goes further. Rather that adding two bits of precision to each sample, the FRExt amendment also includes a 
variant of this scheme which does not require adding precision to the luma samples and which adds only one bit of 
precision to the chroma samples and does not introduce any conversion rounding error. The equations for the forward 
transformation in that case are as follows: 

Co = R-B; t = B + {Co»\)\ Cg = G-t; r=f + (Cg»l); 

where t is an intermediate temporary variable and "»" denotes an arithmetic right shift operation. For the sake of 
completeness (so that the reader can check for themselves to see whether the equations are exact integer inverses of each 
other) we also provide the inverse transformation equations here as follows: 

f=r-(Cg»l); G = / + Cg; fl = /-(Co»l); R = B + Co. 

A 1-bit expansion of sample accuracy is still necessary to represent the transformed data in the YCgCo domain in this 
case, but FRExt has another work-around for this as well. The solution is to retain the use of the RGB domain (in which 
the sample depth is lower) for the input and output pictures and the stored reference pictures while bringing the above 
forward and inverse color transformations inside the encoder and decoder for the processing of the residual data only. 
This technique, called the residual color transform, eliminates color-space conversion error without significantly 
increasing the overall complexity of the system. Its only drawback is that it can only be applied to 4:4:4 video, as its 
operation depends on having both luma and chroma samples available for every sample location. 



In addition to basic coding tools, the H.264/AVC standard enables sending extra supplemental information along with 
the compressed video data. This ordinarily takes a form called "supplemental enhancement information" (SEI) in the 
standard. Such data is specified in a backward-compatible way, so that as new types of supplemental information are 
specified, they can even be used with profiles of the standard that had been previously specified before that definition. 
The first version of the standard includes the definition of a variety of such SEI data, which we will not specifically 
review herein. Instead we focus only on what new types of backward-compatible supplemental and auxiliary data are 
defined in the new FRExt amendment. These new types of data are as follows: 

♦ Auxiliary pictures, which are extra monochrome pictures sent along with the main video stream, and can be 
used for such purposes as alpha blend compositing (specified as a different category of data than SEI). 

♦ Film grain characteristics SEI, which allow a model of film grain statistics to be sent along with the video 
data, enabling an analysis-synthesis style of video enhancement wherein a synthesized film grain is 
generated as a post-process when decoding, rather than burdening the encoder with the representation of 
exact film grain during the encoding process. 

♦ Deblocking filter display preference SEI, which allows the encoder to indicate cases in which the pictures 
prior to the application of the deblocking filter process may be perceptually superior to the filtered pictures. 

♦ Stereo video SEI indicators, which allow the encoder to identify the use of the video on stereoscopic 
displays, with proper identification of which pictures are intended for viewing by each eye. 




3. SUPPLEMENTAL INFORMATION 
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4. PROFILES AND LEVELS 



4.1. The Baseline, Main, and Extended Profiles in the First Version 

H.264/AVC contains a rich set of video coding tools. Not all the coding tools are required for all the applications. For 
example, sophisticated error resilience tools are not important for the networks with very little data corruption or loss. 
Forcing every decoder to implement all the tools would make a decoder unnecessarily complex for some applications. 
Therefore, subsets of coding tools are defined; these subsets are called Profiles. A decoder may choose to implement 
only one subset (Profile) of tools, or choose to implement some or all profiles. The following three profiles were defined 
in the original standard, and remain unchanged in the latest version: 

♦ Baseline (BP) 

♦ Extended (XP) 

♦ Main (MP) 

Table 2 gives a high-level summary of the coding tools included in these profiles. The Baseline profile includes I and P- 
slices, some enhanced error resilience tools (FMO, ASO, and RS), and CAVLC. It does not contain B, SP and Si-slices, 
interlace coding tools or CABAC entropy coding. The Extended profile is a super-set of Baseline, and includes B, SP 
and Sl-slices and interlace coding tools, in addition to all of the Baseline Profile's coding tools and further error 
resilience support in the form of data partitioning (DP). It does not include CABAC. The Main profile includes I, P and 
B-slices, interlace coding tools, CAVLC and CABAC. It does not include enhanced error resilience tools (FMO, ASO, 
RS, and DP) or SP & Sl-slices. 

At this writing, the Baseline Profile appears to be the primary choice for videoconferencing applications. The Main 
profile, which received a great deal of initial implementation interest for entertainment-quality consumer applications, 
now seems to be waning in interest due to the new definition of the High profile in FRExt. 



Table 2: Profiles in Original H.264/AVC Standard 



Coding Tools 


Baseline 


Main 


Extended 


I and P Slices 


X 


X 


X 


CAVLC 


X 


X 


X 


CABAC 




X 




B Slices 




X 


X 


Interlaced Coding 




X 


X 


Enh. Error Resil. (FMO, ASO, RS) 


X 




X 


Further Enh. Error Resil (DP) 






X 


SP and SI Slices 






X 



4.2. The New High Profiles Defined in the FRExt Amendment 

The FRExt amendment defines four new profiles: 

♦ High (HP) 

♦ HighlO(HilOP) 

♦ High 4:2:2 (Hi422P) 

♦ High 4:4:4 (Hi444P) 

All four of these profiles build further upon the design of the prior Main profile, and they all include three enhancements 
of coding efficiency performance. 

♦ Adaptive macroblock-level switching between 8x8 and 4x4 transform block size 

♦ Encoder-specified perceptual-based quantization scaling matrices 

♦ Encoder-specified separate control of each chroma component quantization parameter 

All of these profiles also support monochrome coded video sequences, in addition to typical 4:2:0 video. The difference 
in capability among these profiles is primarily in terms of supported sample bit depths and chroma formats. However, 
the High 4:4:4 profile additionally supports the residual color transform and predictive lossless coding features not found 
in any other profiles. The detailed capabilities of these profiles are shown in Table 3. 
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Table 3: New Profiles in the H.264/AVC FRExt Amendment 



Coding Tools 


High 


High 10 


High 4:2:2 


High 4:4:4 


Main Prr\tfil*» TV*r»lc 
iviaiii r runic ILHJIa 


Y 
A 


X 


X 


X 


8x8 vs. 4x4 Transform Adaptivity 


X 


X 


X 


X 


Quantization Scaling Matrices 


X 


X 


X 


X 


Separate Cb and Cr QP control 


X 


X 


X 


X 


Monochrome video format 


X 


X 


X 


X 


9 and 10 Bit Sample Bit Depth 




X 


X 


X 


4:2:2 Chroma Format 






X 


X 


11 and 12 Bit Sample Bit Depth 








x ! 


4:4:4 Chroma Format 








X 


Residual Color Transform 








X 


Predictive Lossless Coding 








X 



As can be seen in the table, among these new profiles and the prior Main profile there is a neatly-nested "onion-like" 
structure of capabilities - with each "higher" profile also supporting all capabilities of the lower ones. Indeed, the 
standard also requires "higher" profiles to be capable of decoding all bitstreams encoded for the lower nested profiles. 
At the time of this writing, the High profile seems to be overtaking the Main profile as the primary choice for broadcast 
and other entertainment-quality applications, and High 4:2:2 is expected to be used frequently in studio environments. 
The key aspect making the High profile of such intense near-term interest is that it has very little (almost no) added 
implementation complexity relative to the prior Main profile, while improving compression capability in both subjective 
and objective terms (with quantization scaling matrices and transform block-size switching, respectively) and increasing 
encoder control flexibility (with support of separate quantization parameters for the two chroma components). 

4.3. Levels 

For real-time decoders or decoders with constrained memory size, it is important to specify the processing power and the 
memory size needed for implementation. Picture size plays the main role in influencing those parameters. As shown in 
Table 5, H.264/AVC defines 16 different Levels, tied mainly to the picture size. Levels also provide constraints on the 
number of reference pictures and the maximum compressed bit rate that can be used. The level identified as "lb" was 
added in the FRExt amendment, primarily to address the expressed needs of some 3G wireless environments. Because 
the FRExt profiles are specified for more demanding high-fidelity applications, the bit rate capabilities are increased for 
the FRExt profiles as shown in Table 4, which specifies multipliers for the fourth column of Table 5. 

Note: In the standard, levels specify the maximum frame size in terms of only the total number of pixels/frame. 
Horizontal and Vertical maximum sizes are not specified except for constraints that horizontal and vertical sizes can not 
be more than Sqrt(maximum frame size * 8). If, at a particular level, the picture size is less than the one in the table, 
then a larger number of reference pictures (up to 16 frames) can be used for motion estimation and compensation. 
Similarly, instead of specifying a maximum frame rate at each level, maximum sample (pixel) rate, in terms of 
macroblocks per second, is specified. If the picture size is smaller than the typical pictures size in Table 5, then the 
frame rate can be higher than that in Table 5, all the way up to a maximum of 1 72 frames/sec. 

Table 4: Compressed Bit Rate Multipliers for FRExt Profiles (see Table 5 column 4) 



FRExt Profile 


Bit Rate Multiplier 


High 


1.25 


High 10 


3 


High 4:2:2 


4 


High 4:4:4 


4 
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Table 5: Levels in H.264/AVC 



Level Number 


Typical Picture 
Size 


Typical frame rate 


Maximum 
compressed bit rate 

(forVCL)in 
Non-FRExt profiles 


Maximum number of 
reference frames for 
typical picture size 


1 


QCIF 


15 


64 kbps 


4 


lb 


QCIF 


15 


128 kbps 


4 


1.1 


CIF or QCIF 


7.5 (CIF) / 30 (QCIF) 


192 kbps 


2 (CIF)/ 9 (QCIF) 


1.2 


CIF 


15 


384 kbps 


6 


1.3 


CIF 


30 


768 kbps 


6 


2 


CIF 


30 


2 Mbps 


6 


2.1 


HHR(480i or 5760 


30/25 


4 Mbps 


6 


2.2 


SD 


15 


4 Mbps 


5 


3 


SD 


30/25 


10 Mbps 


5 


3.1 


1280x720p 


30 


14 Mbps 


5 


3.2 


1280x720p 


60 


20 Mbps 


4 


4 


HD Formats 
(720por 1080i) 


60p/30i 


20 Mbps 


4 


4.1 


HD Formats 
(720por 1080i) 


60p/30i 


50 Mbps 


4 


4.2 


1920xl080p 


60p 


50 Mbps 


4 


5 


2kxlk 


72 


135 Mbps 


5 


5.1 


2kxlkor4kx2k 


120/30 


240 Mbps 


5 



5. SIMULATION RESULTS 
5.1. Performance of the First Version of the H.264/AVC 

Fig. 7 shows some comparisons of the coding efficiency of MPEG-2, MPEG-4 Part 2 and MPEG-4 Part 10 
(H.264/AVC) for the original version of the H.264/AVC specification when tested on a couple of example video 
sequences. These test results are provided courtesy of the Advanced Technology group of Motorola BCS. In these 
simulations, no rate control was used and Rate-Distortion (R-D) curves corresponding to encoding with different 
standards are presented. These are example plots and the results will vary from one encoder to another and from one test 
video sequence to another. From these plots we see that MPEG-4 Part 2 Advanced Simple Profile (ASP), completed in 
1999, provided about 1.5 times coding gain over MPEG-2. And MPEG-4 Part 10 (H.264/AVC), as completed in 2003, 
provides about 2 times coding gain over MPEG-2 from Rate-Distortion point of view. Note that here the I-frame refresh 
rate is set at every 15 frames. Since advanced motion estimation is a key strength of H.264/AVC, these results actually 
underestimate its relative merits for broader applications not needing such high intra refresh rates. There are other 
aspects of the encoding methods used for H.264/AVC in these tests which are also well known to be sub-optimal 
(particularly its conventional use of non-reference B pictures), so these plots show a conservative estimate. 

For the original version of H.264/AVC, most tests seem to show about a 2:1 gain or better in coding efficiency for the 
new standard. For example, MPEG performed tests that it called "verification tests" in late 2003 [10], and those tests 
showed similar relative fidelity [II] when using testing methods further described in [12]. The tests done by MPEG 
measured subjective video quality, unlike what is shown in Fig. 7 (with subjective testing being a better test method, but 
more difficult to perform rigorously). 
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Fig. 7(a): M & C sequence at CIF (352x288) resolution 
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(b): Bus sequence at CIF resolution 
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Fig. 7(c): M & C sequence at HHR (352x480) resolution (d): Bus sequence at HHR resolution 
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Fig. 7(e): M & C sequence at BT.601 (720x480) resolution 

Fig. 7: (a) - (e) Comparison of R-D curves for MPEG-2 (MP2), MPEG-4 ASP (MP4 ASP) and 
H.264/AVC (MP4AVC). I frames were inserted every 15 frames (N=15) and two non-reference B 
frames per reference I or P frame were used (M=3). 
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5.2. Performance of FRExt High Profile 

As FRExt is still rather new, and as some of the benefit of FRExt is perceptual rather than objective, it is somewhat more 
difficult to measure its capability. One relevant data point is the result of a subjective quality evaluation done by the 
Blu-ray Disc Association (BDA). The summary results are reproduced in Fig. 8 below (with permission) from [9]. This 
test, conducted on 24 frame/sec film content with 1920x1080 progressive-scanning, shows the following nominal results 
(which may or may not be rigorously statistically proven): 

♦ The High profile of FRExt produced nominally better video quality than MPEG-2 when using only one- 
third as many bits (8 Mbps versus 24 Mbps) 

♦ The High profile of FRExt produced nominally transparent (i.e., difficult to distinguish from the original 
video without compression) video quality at only 16 Mbps. 

The quality bar (3.0), considered adequate for use on high-definition packaged media in this organization, was 
significantly surpassed using only 8 Mbps. Again also, there were well-known sub-optimalities in the H.264/AVC 
coding method used in these tests. Thus, the bit rate can likely be reduced significantly below 8 Mbps while remaining 
above the 3.0 quality bar establishing a quality sufficient to call acceptable "HD" in that demanding application. 

The result of an example objective (PSNR) comparison test performed by FastVDO is shown in Fig. 9. These objective 
results confirm the strong performance of the High profile. (Again, sub-optimal uses of B frames and other aspects 
make the plotted performance conservative for FRExt, thus the remark in the figure about potential future performance.) 
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Fig. 8: Perceptual Test of FRExt High Profile Capability by Blu-ray Disc Association [9] 




Fig. 9: Objective Performance of FRExt High Profile vs. MPEG-2 on a test 720p clip. These results are 
consistent with the subjective results in Fig. 8, i,e, 8 Mb/s of FRExt outperforms 20 Mb/s of MPEG-2. 
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6. CONCLUSIONS 

The new video standard known as H.264/AVC presents a rich collection of state-of-the-art video coding capabilities that 
can provide interoperable video broadcast or communication with degrees of capability that far surpass those of prior 
standards. With the new FRExt amendment, and especially the new High Profile, H.264/AVC further bolsters its 
position as the premier design for standardized video compression. We believe these technologies provide a hitherto 
unavailable set of cost/performance points that will have a powerful impact on both consumer and professional video 
applications in the years to come. 
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